An FPGA Embedded Microcontroller

8
An FPGA embedded microcontroller Zbigniew Hajduk Rzeszów University of Technology, ul. Powstan ´ców Warszawy 12, 35-959 Rzeszów, Poland article info Article history: Available online 8 November 2013 Keywords: Field Programmable Gate Arrays Embedded processors abstract The paper presents the design of an 8-bit RISC microcontroller, which is mainly targeted for performing non-timing crucial functions inside FPGAs. The microcontroller is based on popular Microchip PIC16 microcontrollers family. The main feature of the microcontroller is that it is 4 times faster for regular instructions, and 8 times faster for instructions which modify program counter, than its Microchip arche- type clocked at the same frequency. Three versions of the microcontroller instruction cycle structures have been considered and performance tests of the versions have also been carried out. The paper also describes two sample applications which illustrate the usefulness of the microcontroller and show that using the FPGA embedded microcontroller, realization of some functions can be simpler and faster than applying a typical FPGA design flow without the microcontroller. To facilitate frequent exchange of the microcontroller program memory content, specifically at the software developing stage, the downloader module has been proposed to use as well. The downloader allows to directly load the compilers HEX out- put file to the program memory using a generic serial interface. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction The microprocessors and microcontrollers embedded within a Field Programmable Gate Arrays (FPGAs) provide the best features of both digital devices [1,2]. Using microcontrollers programming control sequences or state machines in assembly or high level lan- guages is often easier than creating similar structures in FPGAs [3,4]. Therefore, the embedded microcontroller can implement non-timing crucial complex control functions, while timing critical or data path functions are best implemented in FPGA logic. Cus- tomization is another advantage of embedded microcontrollers, e.g. specific peripherals can be directly connected to microcontrol- ler buses [2]. There are several specific microprocessors designs described in scientific papers. For example, a Forth core which executes instruc- tions of Forth programming language is presented in [5]. A high speed microprocessor with a DSP extension, optimized for Virtex 4 FPGA is described in [6]. An application specific instruction set processor for signal detection in wireless MIMO systems is shown in [7]. FPGA vendors also offer 8- and 32-bit microprocessors intel- lectual property (IP) cores. Well known architectures are Xilinx PicoBlaze [3] and MicroBlaze, Lattice Mico8 [8] and Mico32, and Altera Nios II [9]. However, due to the technical stuff and/or license issues microprocessor IP cores coming from one vendor cannot be used with an FPGA from another vendor. Therefore, the designer has to learn different microprocessors architectures, instruction lists and development tools when using FPGAs from different vendors. In this paper the design of a relatively simple, general purpose 8-bit RISC microcontroller, specially targeted for performing non- timing crucial functions inside an FPGA, is presented. General architecture of the microcontroller is based on the popular Micro- chip PIC16F87x family [10]. The important feature of the microcon- troller is that it is 4 times faster (8 times faster for instructions which modify the program counter) than its Microchip archetype under the same clock frequency. The microcontroller is described using Verilog HDL and can be implemented in FPGAs from any ven- dor. Due to the speed, small number of instructions (only 33 instructions are available), customization facility (availability of a source HDL description makes that the microcontroller can be eas- ily upgraded, modified and adjusted by the designer), and a wide range of software development tools, including C language compil- ers, the microcontroller can be an attractive alternative for other 8- bit microcontrollers, such as Xilinx PicoBlaze and Lattice Mico8. The paper is organized as follows. Section 2 shortly presents similar microcontrollers designs to the one considered in this pa- per, and describes the instruction cycle structures of the three developed versions of the microcontroller and its Microchip arche- type. Section 3 contains the microcontroller architecture descrip- tion and the simulation results analysis of the sample assembler code, executed by the microcontroller. Section 4 discusses the implementation results of the microcontroller versions and con- tains a comparison between existing implementations of similar microcontrollers. A dedicated hardware module, which facilitates frequent exchange of the microcontroller program memory 0141-9331/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.micpro.2013.10.004 Tel.: +48 17 865 12 25; fax: +48 17 854 29 10. E-mail address: [email protected] Microprocessors and Microsystems 38 (2014) 1–8 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Transcript of An FPGA Embedded Microcontroller

Page 1: An FPGA Embedded Microcontroller

Microprocessors and Microsystems 38 (2014) 1–8

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

An FPGA embedded microcontroller

0141-9331/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.micpro.2013.10.004

⇑ Tel.: +48 17 865 12 25; fax: +48 17 854 29 10.E-mail address: [email protected]

Zbigniew Hajduk ⇑Rzeszów University of Technology, ul. Powstanców Warszawy 12, 35-959 Rzeszów, Poland

a r t i c l e i n f o

Article history:Available online 8 November 2013

Keywords:Field Programmable Gate ArraysEmbedded processors

a b s t r a c t

The paper presents the design of an 8-bit RISC microcontroller, which is mainly targeted for performingnon-timing crucial functions inside FPGAs. The microcontroller is based on popular Microchip PIC16microcontrollers family. The main feature of the microcontroller is that it is 4 times faster for regularinstructions, and 8 times faster for instructions which modify program counter, than its Microchip arche-type clocked at the same frequency. Three versions of the microcontroller instruction cycle structureshave been considered and performance tests of the versions have also been carried out. The paper alsodescribes two sample applications which illustrate the usefulness of the microcontroller and show thatusing the FPGA embedded microcontroller, realization of some functions can be simpler and faster thanapplying a typical FPGA design flow without the microcontroller. To facilitate frequent exchange of themicrocontroller program memory content, specifically at the software developing stage, the downloadermodule has been proposed to use as well. The downloader allows to directly load the compilers HEX out-put file to the program memory using a generic serial interface.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

The microprocessors and microcontrollers embedded within aField Programmable Gate Arrays (FPGAs) provide the best featuresof both digital devices [1,2]. Using microcontrollers programmingcontrol sequences or state machines in assembly or high level lan-guages is often easier than creating similar structures in FPGAs[3,4]. Therefore, the embedded microcontroller can implementnon-timing crucial complex control functions, while timing criticalor data path functions are best implemented in FPGA logic. Cus-tomization is another advantage of embedded microcontrollers,e.g. specific peripherals can be directly connected to microcontrol-ler buses [2].

There are several specific microprocessors designs described inscientific papers. For example, a Forth core which executes instruc-tions of Forth programming language is presented in [5]. A highspeed microprocessor with a DSP extension, optimized for Virtex4 FPGA is described in [6]. An application specific instruction setprocessor for signal detection in wireless MIMO systems is shownin [7]. FPGA vendors also offer 8- and 32-bit microprocessors intel-lectual property (IP) cores. Well known architectures are XilinxPicoBlaze [3] and MicroBlaze, Lattice Mico8 [8] and Mico32, andAltera Nios II [9]. However, due to the technical stuff and/or licenseissues microprocessor IP cores coming from one vendor cannot beused with an FPGA from another vendor. Therefore, the designerhas to learn different microprocessors architectures, instruction

lists and development tools when using FPGAs from differentvendors.

In this paper the design of a relatively simple, general purpose8-bit RISC microcontroller, specially targeted for performing non-timing crucial functions inside an FPGA, is presented. Generalarchitecture of the microcontroller is based on the popular Micro-chip PIC16F87x family [10]. The important feature of the microcon-troller is that it is 4 times faster (8 times faster for instructionswhich modify the program counter) than its Microchip archetypeunder the same clock frequency. The microcontroller is describedusing Verilog HDL and can be implemented in FPGAs from any ven-dor. Due to the speed, small number of instructions (only 33instructions are available), customization facility (availability of asource HDL description makes that the microcontroller can be eas-ily upgraded, modified and adjusted by the designer), and a widerange of software development tools, including C language compil-ers, the microcontroller can be an attractive alternative for other 8-bit microcontrollers, such as Xilinx PicoBlaze and Lattice Mico8.

The paper is organized as follows. Section 2 shortly presentssimilar microcontrollers designs to the one considered in this pa-per, and describes the instruction cycle structures of the threedeveloped versions of the microcontroller and its Microchip arche-type. Section 3 contains the microcontroller architecture descrip-tion and the simulation results analysis of the sample assemblercode, executed by the microcontroller. Section 4 discusses theimplementation results of the microcontroller versions and con-tains a comparison between existing implementations of similarmicrocontrollers. A dedicated hardware module, which facilitatesfrequent exchange of the microcontroller program memory

Page 2: An FPGA Embedded Microcontroller

2 Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8

content is presented in Section 5, and two sample applications,which illustrate usefulness of the microcontroller are consideredin Section 6. Section 7 concludes the paper.

2. Microchip PIC microcontrollers for FPGA implementation

Microcontrollers from Microchip PIC16 family are oftenencountered in many engineering applications. For example, aPIC16F877A was used as the main part of a GPRS based positioningsystem [11] and an embedded system which controls scanning tra-jectories of an ultrasound transducer [12]. Another microcontroller(PIC16F688) was employed to control a microstrip antenna opera-tive condition [13].

There are also a number of PIC microcontrollers designs tar-geted for FPGA implementations. IP cores of a simple PIC16C57microcontroller are available from Opencores Organization [14],and from individual designer [15]. Both IP cores, called MINIRISCand RISC8 respectively, have improved instruction cycle structureand are 4 times faster than the original PIC microcontroller. How-ever, instructions which modify the program counter still require 2cycles in case of the RISC8 and 4 cycles for the MINIRISC. Thereforethe latter is two times slower than the PIC16C57 for theseinstructions.

A more sophisticated PIC16C554 IP core, called DFPIC1655x, iscommercially offered by Altera [16]. This microcontroller IP coreis 2 times faster than its Microchip PIC counterpart. Moreover, itcan be implemented only in Altera FPGAs. Another IP core designof a PIC16C6x microcontroller is considered in [17]. No speedimprovements referenced to the original PIC microcontroller werereported in this case.

There are also a few IP cores designs, which bases on the Micro-chip PIC16F84 microcontroller architecture – very similar as incase of the design presented in this paper. Comparison of speed,power, flexibility and cost between this PIC microcontroller andits soft-core version is shown in [18]. Another IP core design of aPIC16F84 microcontroller and a power stability of its implementa-tion is considered in [19]. Yet another design, called CQPIC, wasused as an important part of a pong game project [20]. All threementioned designs implement a standard PIC16F84 instruction cy-cle structure. Therefore, they are not faster, under the same clockfrequency, than their ASIC counterpart. Improved IP cores versions

Q1 Q2 Q3 Q4 Q1 Q2

PCFetch INSTR (PC)

CLK

Fetch IExecut

PC

Execute INSTR (PC-1)

PC

CLK

PC PC+1

PC

CLK

PC+1

CLK

(a)

(b)

(c)

(d)

PC PC PC+1

Fig. 1. An instruction cycle of PIC16 family microcontrollers

of PIC16F84 are yet available from OpenCores: PPX16 [21] andRISC16F84 [22]. These IP cores are subsequently 2 and 4 times fas-ter than the original microcontroller ASIC version.

Although there are soft-microcontrollers designs 2 and 4 timesfaster in comparison with the original Microchip architecture, noneof them is able to execute instructions which modify the programcounter in a single instruction cycle. Single cycle execution ofinstructions which modify PC is a unique feature of the microcon-troller presented in this paper.

The main design goal of the considered microcontroller was tomaximize its performance. An important factor which has a directinfluence on microcontroller performance is the structure of amicrocontroller instruction cycle. The instruction cycle of thePIC16 family microcontrollers is shown in Fig. 1a. The cycle con-sists of four clock phases (clock cycles), named Q1. . .Q4. In Q1phase, the program counter (PC) is incremented, and data memoryis read during Q2. An ALU result is calculated in Q3 and written todata memory in the Q4 phase. During Q4 the instruction is alsofetched from program memory and latched into the instructionregister. Due to the pipelining, each instruction is executed inone instruction cycle (4 clock cycles). The exceptions are instruc-tions which cause PC to change (e.g. CALL, GOTO, RETLW). Theseinstructions require two cycles (8 clock cycles).

In order to find maximum performance of the microcontrollerpresented in this paper, three different structures of the instructioncycle have been considered: a single clock cycle with a dual clockedge (Fig. 1b), a double clock cycle (Fig. 1c), and a single clock cycle(Fig. 1d). The microcontroller versions, which implement suchinstruction cycle structures were called, subsequently A, B and C.As it will be shown in the next section, these versions also differin maximum clock frequency, which can be achieved after FPGAimplementation.

Version A of the microcontroller requires both rising and fallingclock edges, however all instructions are executed within one clockperiod (Fig. 1b). In the first rising clock edge, instruction is fetchedand preliminary decoded. During the following falling clock edgedata memory is read and the PC is updated (incremented or jumpaddress is directly loaded to the PC). In the second rising clock edgedata memory is written and next instruction is fetched. The micro-controller version B is almost the same as A but, instead of bothclock edges, requires two clock cycles (Fig. 1c). Version C is the

Q3 Q4 Q1 Q2 Q3 Q4

PC+1 PC+2

NSTR (PC+1) e INSTR (PC) Fetch INSTR (PC+2)

Execute INSTR (PC+1)

PC+2 PC+3

PC+2 PC+3

PC+2 PC+3

(a) and the three microcontroller versions (b), (c), (d).

Page 3: An FPGA Embedded Microcontroller

Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8 3

most clock edges saving. In this version, PC updating, data memoryreading and ALU result calculating are performed between contig-uous rising clock edges. Data memory is written and next instruc-tion is fetched during second rising clock edge (Fig. 1d).

Note that the microcontroller instructions which modify the PCare also executed within a single instruction cycle (1 clock cycle).Therefore, in case of such instructions, the microcontroller versionA and C is 8 times faster than its ASIC counterpart. In case of allother instructions, the microcontroller is 4 times faster (2 timesfaster for version B).

3. Architecture of the microcontroller

Simplified architecture of the microcontroller is shown in Fig. 2.The main decoder and control unit (MDCU) block constitute animportant part of the microcontroller, which is responsible forinstructions decoding, controlling behavior of pipeline registerR1. . .R3, working register (W) and group of special function regis-ters (SFR), and generating an interrupt request signal. Two purelycombinatorial circuits are situated close to MDCU: evaluating next

OPTIONSTATUSINTCON

FSRPCLATH

PAoutPBoutPCout

PAin

PBin

PCin

GPR BANK 0

GPRBANK

NEXTPCPCLATH

MDCU

INSTRUCTION

ADDRESS13

14

INT

(PC)

int_req

RMUX2

R4

R5

R6

SPECIAL FUNCTIONREGISTERS

8

PRO

GR

AM M

EMO

RY

INTE

RFA

CE

INPU

T PO

RTS

(gpr_addr)

Fig. 2. Simplified architectur

PC value (NEXTPC) block and calculating next stack pointer value(NEXTSP) block. Identical as for Microchip PIC16F877A, the micro-controller is equipped with two data memory banks, called generalpurpose registers (GPRs), physically implemented as a distributedRAM inside FPGA. Address and control buses of the GPR banksare driven by the address decoder (AD) block, which is an combi-natorial circuit as well. The AD block also controls an MUX2 mul-tiplexer, which delivers first operand value for the arithmetic-logic unit (ALU). This value can be either an output of GPR banksor SFR registers or an input port value or a direct value comingfrom the 8 least significant bits of an INSTRUCTION bus. Secondoperand for ALU comes from a W register (accumulator).

Contrary to Microchip PIC16 family, the microcontroller hasseparated input ports (PAin, PBin, PCin) from output ports (PAout,PBout, PCout) – all ports are single direction. Therefore, care mustbe taken when I/O port value is used as a operand (read I/O portinstruction always reads R1. . .R4 registers content).

The microcontroller is also equipped with a simple, 8-bit timerTMR0 with a prescaller – similar to the Microchip PIC16F877A.However, only internal clock input for TMR0 is available. No

W register

TMR0

PRESCALER

+1

PS0 PS1 PS2

ALU

AD

16-LEVELSTACK

+1

1

FSR

NEXTSP(SP)

R1

R2

3

MUX1

MUX3

T0IF

8

8

(sbus)

(dbus)

e of the microcontroller.

Page 4: An FPGA Embedded Microcontroller

Listing 1. Sample of assembler code.

4 Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8

watchdog function has been implemented either. The microcon-troller realizes 33 over 35 instructions of the PIC16F877A. A clearwatchdog timer (CLRWDT) and SLEEP instructions have not beenimplemented. As a program memory for the microcontroller, syn-chronous RAM or ROM memory is required. Ideal candidate for thispurpose are RAM memory blocks embedded inside FPGAs.

As mentioned in the previous section, three versions of themicrocontroller have been considered. From architectural pointof view these versions differ in the number of pipeline registersR1. . .R3 and the way in which these registers are controlled. Thereare also few changes in the MDCU block between the microcontrol-ler versions. In case of version A only R1 and R2 registers are imple-mented. These registers are switched during falling clock edge.Version B exploits all three registers R1. . .R3. The active clock edgefor the registers is second rising (refer to Fig. 1c and the descriptionin the previous section). In case of version C, no pipeline registersare used (all registers are bypassed).

In order to better understand how the microcontroller workslet us consider a very simple assembler code shown in Listing 1and the simulation results from Fig. 3. First two columns of hexa-decimal values in Listing 1 contain addresses and instructionscodes of the assembler mnemonics situated on the right handside. Fig. 3 shows, in turn, the state of address, instruction andoutput port A buses, and selected internal signals of the micro-controller (prefixed with ‘‘_’’ character). Most of the names ofthese signals correspond to the labels in Fig. 2. The _faddr bus,not shown in Fig 2., represents a file register address, which

Fig. 3. Behavioral simulation resu

comes either from the 8 least significant bits of the instructionword or the FSR register – depending on current addressingmode. This bus states an internal part of the AD block and it isused e.g. as a source for GPR banks address calculation (_gpr_addrbus). Simulation waveforms from Fig. 3 refer to the microcontrol-ler version C.

First three instructions from the listing cause to load 05h valueto the W register, send the W register value to the port A outputand store W register content in data memory (GPR) at the 55h loca-tion. The subsequent changes of the W register content and thestate of the port A output (PAout) can be observed in Fig 3. Notethat when the code of the instruction emerges on the INSTRUC-TION bus, the ADDRESS bus immediately takes the address of thenext instruction (NEXTPC block in Fig. 2 is responsible for suchbehavior).

After the three instructions mentioned above, the next executedinstruction is ‘‘addwf 55h,1’’. At the same time when ADDRESS busis updated, the AD block (Fig. 2) asserts the address for data mem-ory (gpr_addr bus in Fig. 3) and the control signal for the MUX2multiplexer. Subsequently, data memory is read asynchronously,and the read value is conveyed to the source bus (sbus in Fig. 2)and the first input of ALU (sbus=05h in Fig. 3). Since the MDCUcauses the ALU to perform an addition and the second operandfor ALU comes from the W register (05h in this case), the ALU resultis properly calculated (0Ah in this case) and moved to the destina-tion bus (dbus). While the nearest rising clock edge value of dbus iswritten to the same location in data memory from which data werepreviously read (55h in this case).

The rising clock edge also causes the next instruction code(2018h) to appear on the INSTRUCTION bus. This instruction is acall of subroutine located at the address of 018h. The address di-rectly appears on the ADDRESS bus, and during the next risingclock edge the first instruction of the subroutine is fetched andsubsequently executed. Therefore, as can be seen on the simulationwaveforms from Fig. 3, instructions which modify program counterare executed within a single clock cycle.

The subroutine from Listing 1 also illustrates indirect address-ing capabilities of the microcontroller. The code is responsible forthe write of the value read from 55h data memory location underthe address contained in the W register (05h in this case, which is aport A address). This results in the change of the output port A from05h to 0Ah.

lts of the code from Listing 1.

Page 5: An FPGA Embedded Microcontroller

Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8 5

After return of the subroutine, which also takes one clock cycle,the next instruction (‘‘btfsc 55h,0’’) tests bit 0 (LSB bit) of the valueread from 055h data memory location. Since the bit is clear in thiscase, this causes to skip execution of the next instruction located at011h – skip signal is asserted for one clock cycle (the microcontrol-ler performs no operation at the time). This is the only case whenthe execution of an instruction takes two instruction cycles, similarto PIC16 family. However, for the microcontroller there are fewerclock cycles (4 times) than in case of PIC16 family.

4. Implementation results

In order to determine synthesis quality indicators (FPGA logicresource requirements and maximum clock frequency) of themicrocontroller, the three microcontroller versions have beenimplemented in a selected FPGA chip from Xilinx (Spartan 3A fam-ily). The Digilent UG-330 prototype board has been used. A perfor-mance test, which mainly realizes calculation of isolated primarynumbers within selected range, has been conducted as well. Theresults are shown in Table 1.

Note that benchmark calculation times were given under theconditions where all of the microcontroller versions were clockedwith the frequency very close to their maximum allowable value(36 MHz, 86.1 MHz and 55 MHz respectively). If equal clock fre-quency was applied the benchmark calculation times of the micro-controller versions A and C would be the same. In case of themicrocontroller version B the benchmark time would be two timeslonger compared to other versions. It is directly related to thestructure of the microcontroller instruction cycle versions.

The microcontroller version A can be clocked with the lowestfrequency and characterizes the lowest performance as well (thelongest computation time – 15% and 33% more than the versionB and C). However it requires the lowest logic resources (9% and16% less than the version B and C). This microcontroller versioncan be useful in application in which logic resources requirementis a more important factor than the microcontroller performance.

The microcontroller version B characterizes the highest maxi-mum clock frequency. However, its performance is moderate. Thismicrocontroller can be exploited in application in which high clockfrequency counts (the overall performance of a synchronous sys-tem is determined by its part clocked with the lowest frequency).

The highest performance, obtained under moderate value ofmaximum clock frequency, is a feature of the microcontroller ver-sion C. The downside of the version is that it requires the mostFPGA logic resources.

Table 2 shows a short comparison between existing IP cores of8-bit PIC microcontrollers implementations, mentioned in Sec-tion 2. The comparison includes the following IP cores: MINIRISC[14], RISC8 [15], 16F84 [18], PPX16 [21], CQPIC [20], RISC16F84[22] and the microcontroller version C, considered in this paper.With the exception of the third IP core listed in Table 2 (16F84),all other cores are freely available and can be downloaded fromspecific websites. For comparison purposes, these IP cores havebeen implemented in Xilinx Spartan-3A FPGA, using default syn-thesis and implementation strategy of a Xilinx ISE Design Suitetool, and post-implementation results have been placed in the last

Table 1Post-route implementation results and benchmark program calculation times.

Version Number ofslices

Max clock frequency(MHz)

Benchmark calculationtime (ls)

A 491 36.7 655.492B 536 86.8 558.225C 571 55.8 436.995

two columns of Table 2. In case of the third mentioned IP core, thevalues reported by the authors [18] have been taken intoconsideration.

It is important to note that the IP cores differ in the capacity ofdata and stack memory, and a presence of the Timer0 and thewatch dog timer (WDT). These are significant factors influencingon the FPGA logic resources requirement of the IP cores implemen-tations. Besides, the first two IP cores (MINIRISC and RISC8) arebased on the Microchip PIC16C57 architecture, which is simplerthan that of the PIC16F84 (e.g. PIC16C57 does not have any inter-rupt system, the instruction word has a shorter width). Therefore,resources requirement for these IP core is naturally smaller thanfor the others.

As can be seen in Table 2, the microcontroller version C charac-terizes the highest absolute FPGA resources requirement and thesmallest maximum clock frequency, but it is the only one whichis able to execute jump instructions (instructions which modifythe program counter) in a single cycle. However, if we refer thenumber of slices required by the IP cores implementations to thetotal available slices in the particular FPGA chip – for example anXC3S700A from the Digilent UG-330 board (which is rather a lowdensity FPGA chip) – then the difference between the smallest(RISC16F84) and the highest (the microcontroller version C) valueis no more than 2.2% for this chip. It is not a significant difference.

As far as the microcontroller version B is concerned, which isnot listed in Table 2 (with the exception of the last three columnsin Table 2, other parameters of the microcontrollers versions A andB are the same as for the microcontroller version C) it characterizesrelatively high maximum clock frequency (see Table 1) in compar-ison to other implementations of the IP cores based on thePIC16F84 architecture. The microcontroller version A has, in turn,a relatively low logic resources requirement, however, its maxi-mum clock frequency is significantly lower than the lowest fre-quency of other considered implementations. Nevertheless, thesame as the microcontroller version C, it is able to execute jumpinstructions in a single clock cycle, which no other implementationis capable of.

Among the parameters included in Table 2, a performance eval-uation of the IP cores implementations have also been carried out.Unfortunately, the evaluation – conducted in the same way as forthe microcontroller introduced in this paper, fully succeeded inonly one case: for the RISC16F84 IP core. CQPIC and PPX16 IP coresimplementations gave incorrect primary numbers computationsresults and the benchmark calculation times were totally implau-sible (5.542 ms and 11.378 ms respectively). It may suggest thatthese IP cores have not been fully tested by the designers and havesome errors, probably in ALU unit, which are revealed in thisspecific conditions. In case of the MINIRISC and the RISC8, the usedHITECH C compiler failed to compile the benchmark source codefor the PIC16C57 architecture. The reason was probably relatedto the fact, that this Microchip architecture is only capable of han-dling 2 stack levels, whereas the benchmark procedure requires atleast 3 levels.

As Table 3 shows, the measured benchmark calculation time forthe IP core implementation of the RISC16F84, clocked at 100 MHz(very close to its maximum clock frequency), brought out553.839 ms. It is 116.844 ms more than the benchmark time forthe microcontroller version C, clocked at 55 MHz, and 4.386 msless in reference to the benchmark time of the microcontroller ver-sion B, clocked at 86.1 MHz (13.9% lower frequency than for theRISC16F84). This shows a very good performance of the microcon-troller version C, despite the fact it was clocked almost 2 timesslower. In case of the RISC16F84 implementation, clocked at86.1 MHz – the same frequency as for the microcontroller versionB – the benchmark time brought out 643.168 ms. It is 15.2% morethan the microcontroller version B. Both the RISC16F84 and the

Page 6: An FPGA Embedded Microcontroller

Table 2Comparison of different PIC microcontrollers IP cores implementations.

Microcontrollername

ASICcounterpart

Number of datamemory bytes

Number ofstack levels

Timer0/WDTimple-mented

Number of clock cycles forregular/jump instructions

Number ofSpartan3A slices

Maximum clockfrequency (MHz)

MINIRISC PIC16C57 72 4 +/+ 1/4 347 81.9RISC8 PIC16C57 72 2 +/� 1/2 292 64.116F84 PIC16F84 128 8 �/� 4/8 568 69.7PPX16 PIC16F84 68 8 +/� 1/2 511 62.0CQPIC PIC16F84 68 8 +/+ 4/8 498 99.0RISC16F84 PIC16F84 192 16 �/� 2/4 461 102.5The microcontroller

version CPIC16F87x 192 16 +/� 1/1 571 55.8

Table 3Performance comparison between IP cores implementations of RISC16F84 and themicrocontroller.

IP core implementation Clock frequency(MHz)

Benchmark calculationtime (ms)

RISC16F84 100.0 553.839RISC16F84 86.1 643.168The microcontroller version C 55.0 436.995The microcontroller version B 86.1 558.225

6 Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8

microcontroller version B require 2 clock cycles for regular instruc-tions but for jump instructions the RISC16F84 requires as many cy-cles as twice than for the microcontroller version B. This factexplains the microcontroller version B performance advantagewhen equal clock frequency is applied.

5. Hardware downloader

Preparing and testing software for FPGA embedded microcon-trollers often requires multiple data downloading to the microcon-troller program memory, usually implemented as a dedicated RAMblocks inside an FPGA. Initializing of such memory basically can bedone during the project implementation process. In this approachany change of the microcontroller memory content requires reim-plementation of whole project, which is a time consuming opera-tion and is not convenient for frequent program code exchange.There is also a feasibility of direct change of a part of an FPGA con-figuration bitstream, which is responsible for memory initializa-tion, without the need of project reimplementation [23].However, it is not a straightforward process. For example, for Xi-linx FPGAs two specific files must be prepared (one for descriptionof RAM blocks connections and their physical localization inside anFPGA, and the second with a new memory content) and a dedi-cated Xilinx application (data2mem) must be used to update theFPGA bitstream file.

CLK

RX

TX

RST_CPU

WR_EN

ADDR

DATA

RST

ADDRB

DINB

WEB

CLKB

ADD

DO

C

CLK1

CLK2

RST

SERIAL_RX

SERIAL_TX

DOWNLOADER

DUAL POBLOCK R

Fig. 4. Connection of the downloader module to t

To facilitate frequent changes of the microcontroller programmemory content, in this paper we propose to add to the FPGA pro-ject an IP core of a dedicated module – the downloader – whichwould be able to load the new memory content during the systemoperation, using a serial interface. Although adding a new compo-nent to the project increases logic resources requirement, thedownloader module is only needed during the software develop-ment process. When the process will be completed, the download-er IP core can be removed from the project. Therefore, FPGAresources requirement for a final implementation of the projectdoes not increase.

Fig. 4 shows the way of the downloader module is connected tothe program memory and the microcontroller. The memory (blockRAM) has a dual port – the first port is connected to the microcon-troller, while the second is driven by the downloader. The down-loader also drives a reset input (RST) of the microcontroller(during data transfer the microcontroller is brought in reset state).In the schematic from Fig. 4 there are two clock signals: CLK1 forclocking the downloader (frequency of this clock is strongly relatedto the data rate of the downloader serial transceiver) and CLK2 forclocking the microcontroller. In a specific case the clock signals canbe connected together.

The downloader uses a generic serial interface for data transfer.Therefore, important parts of the module are serial receiver andtransmitter, and a direct digital frequency synthesizer. The lattergenerates suitable clock frequencies for the transmitter and the re-ceiver. An essential part of the downloader is an Intel HEX data for-mat interpreter. The interpreter makes the direct processing ofcompilers output file possible, for which Intel HEX is a commondata format. The downloader module requires little FPGA logic re-sources (285 slices for Spartan 3A, whereas the microcontrollerneeds from 491 to 571 slices).

On the user side, the change of program memory content re-quires only to use an application such as Hyperterminal to directlytransfer the compiler HEX output file to the downloader. So, it ismuch easier than using the most common encounteredapproaches.

RA

UTA

LKA

ADDRESS

INSTRUCTION

CLK

RST

PAoutPBoutPCout

PAinPBinPCinINT

13

14

RTAM MICROCONTROLLER

8-BIT INPU

T/O

UTPU

T POR

TS

INTERRUPT

he program memory and the microcontroller.

Page 7: An FPGA Embedded Microcontroller

Table 4FPGA resources requirement for different approaches to the application realization.

Approach Numberof slices

Number of blockRAM blocks

Hardware 317 0Hardware with UART transmitter 382 0Software 530 2

Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8 7

6. Sample applications

To illustrate the usefulness of the embedded microcontroller,two sample applications have been developed. The main task ofthe first application is to read the temperature from 1-wire sensor(DS18B20 from Dallas/Maxim) and present it on the LCD display.The second application involves specific peripherals – the Man-chester code decoder – integrated with the microcontroller, andit can be used as a simple access control system with RFID tags.Both applications have been implemented and tested on the Digi-lent UG-330 evaluation board.

The first application has been realized in two versions: usingtypical FPGA design flow (a set of specific IP cores described in Ver-ilog HDL) – hardware approach, and using the embedded micro-controller – software approach. In case of the first approach, IPcores of 1-wire bus and LCD display service, and data interpreta-tion from the sensor had to be developed and tested. It is harder,less flexible and more time consuming than realizing the sametasks by writing a software in a high level programming languagesuch as C. However, as implementation results show, in case of theconsidered application, the hardware approach requires less FPGAresources (317 slices vs. 530 slices – see Table 4), which is anadvantage of the approach. Moreover, adding to the applicationan extra functionality of sending the temperature value via serialport increases resource requirements to 382 slices in case of hard-ware approach. With the software approach, the same functional-ity does not require any additional resources (program code size isonly increased).

The second application is an example of tasks dividing be-tween software and hardware inside an FPGA. The Manchesterdecoder, realized as a hardware block, analyzes and decodes thedata from external RFID demodulator (reader) and makes themavailable to the embedded microcontroller software. The softwarecompares the identification bits (IDs), which are read from RFIDtag, with the authorized IDs stored in the memory, and displayssome information on the LCD display. In this case the softwarepart of the application focuses on the essential functionality andthe user interface, which is much easier and faster to realize bythe software. The application requires 830 slices of Spartan 3AFPGA.

7. Conclusions

It has been shown that the use of even a simple 8-bit micro-controller embedded in an FPGA may be helpful to implementsome application functionalities. General non-timing crucial tasks,such as user interface service, can be realized by software, whichis easier more flexible and faster to develop than an analogoushardware design. However, depending on the embedded micro-controller FPGA logic resource requirements, software realizationof some tasks may need more resources than its hardwarecounterpart.

Developing and testing software for the sample applicationsmentioned in Section 6, has proved that the downloader modulecan be useful to support a frequent microcontroller program code

exchange. Without the downloader, any change of the microcon-troller program memory content is very inconvenient and timeconsuming.

Contrary to the specific microcontrollers offered by FPGA ven-dors, such as Xilinx PicoBlaze or Lattice Mico8, the presentedmicrocontroller is based on popular Microchip PIC16 architectureand can be implemented in FPGA from any vendor. Moreover,the microcontroller is significantly faster than its Microchiparchetype.

Considering the different instruction cycle structure of themicrocontroller it has been shown that the best performance is ob-tained when the instruction cycle takes one clock cycle with a sin-gle clock edge. Adding pipeline registers increases maximum clockfrequency (in case single clock edge is used) but it also increasesthe number of clock cycles needed to execute an instruction.Therefore overall performance does not increase.

Comparing the microcontroller to other similar implementa-tions, its weaknesses are a relatively high logic resources require-ment and a low maximum clock frequency (especially for themicrocontroller version A and C). However, the unique feature ofthe microcontroller is the ability to execute jump instructions ina single instruction cycle, which is not encountered in similarimplementations. This feature along with a specific instructioncycle structures, which takes only one clock cycle, ensures a verygood performance of the microcontroller.

Future work will concentrate on increasing maximum clock fre-quency of the microcontroller (a critical path needs to be identifiedand redesigned). A trial of a fully asynchronous FPGA implementa-tion design of the microcontroller is also planned. Some promisingresults have already been achieved in this demanding domain ofasynchronous circuit implementation in commercial FPGAs.

Appendix A. Supplementary material

The Verilog source codes (IP cores) of the three microcontrollerversions, the downloader and the sample applications as well as Csource codes of the benchmark test and software part of the sam-ple applications are available as the supplementary data. Supple-mentary data associated with this article can be found, in theonline version, at http://dx.doi.org/10.1016/j.micpro.2013.10.004.

References

[1] T. Schnettler, Microcontroller Design in FPGAs, EE Times, 2008. <http://www.eetimes.com/design/programmable-logic/4015188/Microcontroller-Design-in-FPGAs>.

[2] B.H. Fletcher, FPGA Embedded Processor – Revealling True SystemPerformance, Embedded Systems Conference San Francisco, 2005, pp. 1–18.

[3] Xilinx, PicoBlaze 8-bit Embedded Microcontroller User Guide, UG129, June2011.

[4] U. Meyer-Baese, Digital Signal Processing with Field Programmable GateArrays, Springer, Berlin Heidelberg, 2007.

[5] R.E. Haskell, D.M. Hanna, A VHDL-forth core for FPGAs, Microprocessors andMicrosystems 28 (3) (2004) 115–125.

[6] A. Ehliar, P. Karlstrom, D. Liu, A high performance microprocessor with DSPextensions optimized for the Virtex-4 FPGA, in: Int. Conf. on FieldProgrammable Logic and Applications, 2008, pp. 599–602.

[7] M. Tamagnone, M. Martina, G. Masera, An application specific instruction setprocessor based implementation for signal detection in multiple antennasystems, Microprocessors and Microsystems 36 (3) (2012) 245–256.

[8] Lattice Semiconductor Corporation, LatticeMico8 Open, Free 8-bit SoftMicrocontroller. <http://www.latticesemi.com/products/intellectualproperty/ipcores/mico8.cfm>.

[9] Altera Corp., Nios II Processor: The World’s Most Versatile EmbeddedProcessor. <http://www.altera.com/devices/processor/nios2/ni2-index.html>.

[10] Microchip, PIC16F87XA 28/40/44-Pin Enhanced Flash Microcontrollers.<http://ww1.microchip.com/downloads/en/DeviceDoc/39582C.pdf>.

[11] W.M. El-Medany, A. Alomary, R. Al-Hakim, S. Al-Irhayim, M. Nousif,Implementation of GPRS-Based Positioning System Using PICMicrocontroller, in: Int. Conf. on Computational Intelligence, CommunicationSystems and Networks (CICSyN), 2010, pp. 365–368.

[12] P.N. Rivera-Arzola, J.C. Ramos-Fernandez, J.M.O. Franco, M. Villanueva-Ibanez,M.A. Flores-Gonzalez, A PIC microcontroller embedded system for medical

Page 8: An FPGA Embedded Microcontroller

8 Z. Hajduk / Microprocessors and Microsystems 38 (2014) 1–8

rehabilitation using ultrasonic stimulation through controlling planar X-Yscanning Trajectories, in: IEEE Conf. on Electronics, Robotics and AutomotiveMechanics (CERMA), 2011, pp. 307–310.

[13] S. Genovesi, A. Monorchio, M.B. Borgese, S. Pisu, F.B. Valeri, Frequency-reconfigurable microstrip antenna with biasing network driven by a PICmicrocontroller, IEEE Antennas and Wireless Propagation Letters 11 (2012)156–159.

[14] Opencores Orgranization, Mini-Risc core Overview. <http://opencores.org/project,minirisc,overview>.

[15] T. Coonan, RISC8 core. <http://www.mindspring.com/~tcoonan/risc8.pdf>.[16] Altera Corp., DFPIC1655X - RISC Microcontroller. <http://www.altera.com/

products/ip/processors/8_4bit/m-dcd_dfic1655x.html>.[17] H. Madasi, S.N. Bhavanam, V. Midasala, A PIC compatible RISC CPU core

implementation for FPGA based configurable SOC platform for embeddedapplications, ACEEE International Journal on Electrical and Power Engineering2 (2) (2011) 11–15.

[18] D.F. Gómez Prado, Embedded Microcontrollers and FPGAs Soft-cores,Electronica 18 (2006) 3–14. <http://sisbib.unmsm.edu.pe/bibvirtualdata/publicaciones/electronica/n18_2006/a02.pdf>.

[19] S.Y. Yuan, F. Chia, P.S. Chang, S.S. Liao, The power stability of FPGA-basedmicrocontroller design and measurement, in: Asia–Pacific Symposium onElectromagnetic Compatibility (APEMC), 2010, pp. 1096–1099.

[20] S. Moutou, Using CQPIC Soft Processor, 2010. <http://moutou.pagesperso-orange.fr/ER2/Core16F84_en.pdf>.

[21] Opencores Orgranization, PPX16 mcu Overview. <http://opencores.org/project,ppx16>.

[22] Opencores Orgranization, Risc16f84 Overview. <http://opencores.org/project,risc16f84>.

[23] Xilinx, Data2MEM User Guide, June 2009.

Zbigniew Hajduk received M.Sc. (Eng.) degree from theRzeszów University of Technology, Rzeszów, Poland, in1998 and the Ph.D. degree from the University of Zie-lona Góra, Zielona Góra, Poland, in 2006.He is currently an Assistant Professor in the Departmentof Electrical and Computer Engineering, Rzeszów Uni-versity of Technology, Poland. His main area of interestincludes digital systems design with FPGAs. He is anauthor of two books (in Polish) concerning microcon-trollers in remote control systems and FPGA design withthe use of the Verilog Hardware Description Language.