LEON3 System-on-Chip Port for BEE2 and ASIC...

1

LEON3 System-on-Chip Port for BEE2 and ASIC Implementation

Timothy Wong

1. Introduction

The LEON3 System-on-Chip platform is part of an open-source IP library from Gaisler

Research, GRLIB [4]. The system and its derivatives have been used both in professional and

academic research applications. The Berkeley Emulation Engine 2 (BEE2) [2] is a multi-chip

FPGA board getting extensive use in research as an enabler for multi-core emulation and

hardware acceleration for research applications. In this report we detail the modification for

LEON3 system for use with the BEE2. The LEON3 implementation of the established SPARC

V8 instruction set architecture (ISA) in combination with its open-source implementation and

attractive software support make it an increasingly popular platform for research applications.

Combined with the high resource and high performance characteristics of the BEE2 allows wider

range of LEON3 system configurations and another option for more application friendly multi-

core research. We also modify the system for ASIC implementation and show viability for use as

a real-world driver for current 2D CAD flows and a possible test bed for future 3D CAD flows.

2. Gaisler Research GRLIB

The GRLIB IP library is distributed under the GNU GPL open-source license from

Gaisler Research. The library includes extensive support for many FPGA boards. It includes the

LEON3 processor core along with debug unit, memory interfaces, and a number of other

peripheral modules that all implement interfaces for ARM AMBA 2.0 bus [1].

The library is attractive because of its full system-on-chip implementation that includes

the LEON3 processor that uses a well supported instruction set (SPARC V8) and is capable of

2

symmetric multiprocessing (SMP). It provides the ability for easier on-chip debugging and a full

compilation toolchain for programs including a memory allocation library and a full Linux 2.6

operating system that can run on the system when implemented on FPGA.

2.1 LEON3 Processor Core Basic Features

• 32-bit SPARC V8 ISA • ARM’s AMBA Advanced High Performance Bus (AHB) interface • 7-stage pipeline • Multiprocessing support (processor ID, snooping) • Separate instruction and data cache

Write-through Configurable associativity, cache line size, replacement policy

• Optional features Memory management unit Hardware FPU, multiply-accumulate (MAC), Debug support

Figure 1: LEON3 Core Block Diagram

3

3. LEON3 on Xilinx XUP

Initial work with the processor was done with the Xilinx XUP board [11]. This board

uses a Virtex2 Pro 30 FPGA with a DDR memory interface and JTAG, RS232, and 10/100

Ethernet PHY connections as peripheral I/O options. As this board is supported in the GRLIB

library as a reference design, the main objective of using this board was to serve as a test

platform. We wanted to get the system operational verify that it could run the provided Linux

implementation and run multithreaded programs. We also wanted to determine the approximate

size of the core and see how many could fit on the FPGA.

As stated previously, the XUP board is supported by the GRLIB library as a reference

design. Therefore, getting basic functionality was a matter of familiarizing oneself with the

VHDL code base and hardware configuration options as well as configuring and compiling the

recommended Snapgear Linux operating system.

3.1 LEON3 System-on-Chip XUP Configuration

2 LEON3 processor cores with SMP 65MHz system clock with XST 32-byte cache line (block) 4KB 2-way I-cache

LRU replacement policy 4KB direct-mapped D-cache Memory management unit Split instruction and data TLB

8 entries each LRU replacement policy

Instruction trace buffer Hardware multiply accumulate unit 100MHz DDR clock Round robin AMBA AHB arbitration policy

Figure 2: LEON3 SoC Xilinx Virtex2Pro30 Floorplan

4

Figure 3: LEON3 System-on-Chip XUP Block Diagram

3.2 LEON3 System-on-Chip Xilinx XUP Synthesis and Resource Utilization

The main application used for synthesis of the LEON3 system was the Xilinx XST

synthesis tool in the Xilinx ISE 9.1i development environment. Attempts to use other tools for

synthesis such as Synplify’s Synplicity were met with incompatibilities with Xilinx’s mapping

and place and route tools used to implement the design for FPGA.

Resource Utilization Used Available Utilization LEON3 SoC XUP

Number of Slices 13,309 13,696 0.971744

Number of 4-LUTs 25,431 27,392 0.92841Equivalent Gate Count 2,591,690 LEON3 Single Core Number of Slices 9,225 13,696 0.673554Number of 4-LUTs 10,615 27,392 0.387522

Equivalent Gate Count 1,015,990 Table 1: LEON3 XUP Resource Utilization

5

Each core in the design uses roughly 39% of the available hardware logic resources (4-LUTs).

Since the system requires that the number of cores in the system be a power of 2, it was

impossible to synthesize more than 2 cores for this FPGA without running out of resources.

Having 2 cores however, enabled testing of the software toolchain, Snapgear Linux in SMP

mode, and multithreaded programs.

3.3 Debug Tool chain

Gaisler Research provides two simulation environments, TSIM and GRSIM, that can

simulate LEON processors of different configurations. Either can be used alone or used in a

larger framework, for example connecting to the GNU GDB debugger. However, there was not

too interest in the internal behavior of the processor core and there were no complications

bringing the system up that necessitated the need to experiment with these simulators.

The main debug tool for GRLIB is the GRMON Debug Monitor [7] available for Linux

or with graphical interface in Linux and Windows with Java. The software connects to system

through JTAG, serial, or Ethernet interfaces. Through this I/O connection, the software interacts

with the system through the Debug Support Unit (DSU) on the chip. The DSU must be

implemented for GRMON to function.

Through this framework, the user can list the system configuration to confirm the

presence of modules in the system and view their configuration. This is done by reading status

registers in the modules that are implemented at hardware synthesis. These configuration

registers have their own bus lines and can be read easily by the bus arbitration unit and bus

masters (DSU with LEON). Because these are just implemented as registers, one could also use

them for other purposes like performance counters. Since the AMBA bus design assigns a

memory space to each AMBA slave, registers can alternatively be mapped to memory addresses.

6

So, reading and writing status registers can be no different from reading and writing actual

memory. Clearly, this bus design also allows the user to read and write any part of memory.

GRMON also has control over execution of the processor. One can load programs into

memory, run them, step through instructions, and set breakpoints and watchpoints much like

GDB. Symbols can also be loaded into the program for easier debugging. The process of booting

Snapgear Linux on the system is to load the compiled operating system image into RAM, set the

program counter to the beginning of the image, and initiate execution on the LEON.

3.4 Software Environment

The version of Linux that was used on this system was the latest version as of 6/07,

Snagear-p33a based on the Linux 2.6 kernel [9]. A graphical user interface is provided to

configure operating system services and choose basic system applications to install before

compiling. For the purposes of this project, the only options of importance were to enable SMP

support, networking services, and to choose a C library to implement (either the glibc or uClibc).

Custom applications should be compiled with the operating system and included in the

resulting image file. Source code must be added to the “user/custom” directory along with the

appropriate compilation directives in the directory’s Makefile. In order to test SMP functionality,

we compiled two multithreaded applications from the SPLASH-2 benchmark suite [10], radix

and fft, to be loaded with the operating system.

Once compiled into an image, the operating system is booted with GRMON using the

method described previously. By default, the operating system connects to the console to the

FPGA board’s serial port. The serial port is connected to another computer, where minicom was

used to interact with the system.

7

Despite the mostly smooth operation after learning the toolchain, we found that the

POSIX thread (Pthread) library was not functional. Although single-threaded application ran

without error, multithreaded applications would halt immediately at the Pthread function

invocations. Several versions of Snapgear were tested with different configurations using glibc

and uClibc. Changing the multithreading macros in the application also did not fix the problem.

We notified the developers of this bug and decided to proceed to the next step anticipating a fix

with a later release of the operating system.

4. LEON3 on Berkeley Emulation Engine 2 (BEE2)

As noted previously, the XUP board only provided enough resources for two LEON3

cores to be synthesized. We decided then to move to a larger FPGA. The BEE2 provides 5

Virtex2Pro 70 FPGAs, each with more logic and memory resources than the Virtex2Pro 30

found on the XUP board. In addition, the board provides the same JTAG, serial, and Ethernet I/O

ports needed for the LEON system. Unlike the XUP board however, each FPGA has connections

to 4 DDR2 memory modules, has different I/O pin locations, and uses an external system clock.

To overcome these last items, significant modification to the system memory interface needed as

well as detailed knowledge of the BEE2

reference modules and board design.

Since the BEE2 is mainly

developed for and used by the academic

audience, reference documentation was

relatively difficult to locate and not

particularly user-friendly. Fortunately,

our previous work implementing the

Figure 4: Control FPGA I/O Diagram [3]

8

OpenRisc 1200 core on the BEE2 helped immensely to deal with the board infrastructure bring

up such as dealing with the external system clock and identifying the appropriate board pin

locations for the design.

The different memory interface however, required designing new modules to interface

between the AHB and DDR2 memory. The first steps were to get familiar with the BEE2 DDR2

Controller reference design and the GRLIB DDR Controller design to see where the new design

would fit in.

4.1 BEE2 Reference DDR2 Controller

The BEE2 reference DDR2 Controller was originally designed for use with the IBM

PowerPC processors on the Virtex2Pro FPGA. The controller is partitioned into several parts: an

interface for IBM’s CoreConnect Processor Local Bus (PLB), an arbiter for forwarding requests

from multiple modules, an asynchronous FIFO interface for crossing the core/memory clock

boundary, the DDR2 controller itself, and infrastructure to generate and propagate the DDR2

clock.

4.2 GRLIB DDR Controller

The GRLIB DDR Controller is partitioned somewhat differently from the BEE2 design.

It consists of a DDR controller and a DDR PHY. The DDR controller module implements a

slave interface to process AHB requests and an interface to the DDR PHY to output the

appropriate DDR commands. Thus, the controller has two clock inputs and works across the

clock boundary between the AHB clock and DDR clock. Since GRLIB is meant to support

multiple platforms, the DDR PHY is used to provide the appropriate platform dependent DDR

clock generation and I/O buffering.

9

Figure 5: BEE2 Reference DDR2 Controller vs. GRLIB DDR Controller Block Diagram

4.3 AMBA DDR2 Interface

After analyzing the two designs, it was clear that the DDR2 controller and infrastructure

from the BEE2 design should be reused rather than a new DDR2 controller developed from

scratch. The asynchronous FIFO design was also a convenient way to cross the AHB and DDR2

clock domains. As a result, the new module had to take AHB slave requests and convert them to

the appropriate memory commands for the DDR2 controller. It also had to take the DDR2

controller response and generate the appropriate AHB slave response. The GRLIB DDR

controller would serve as a basis for comparison with the new module in terms of interpreting

and generating AHB signals.

10

Figure 6: LEON3 System-on-Chip BEE2 Block Diagram

4.3.1 AMBA 2.0 AHB Specification

To understand the design of the AMBA DDR2 Interface module, it is necessary to

understand the AHB specification [1]. For this module, we are mainly concerned with the slave

request and response mechanism. What complicates the design is the pipelined nature of the bus

specification. Rather than a 2-cycle request/response mechanism, transactions are pipelined.

Requests are made while responses from previous requests are being returned. We provide a

sample AMBA timing diagram (Figure 7).

In diagram, the “Control” signal corresponds to a read or write command with burst

length and data width. HREADY and HRDATA are slave responses to the command. The timing

as explained in the AMBA 2.0 technical documentation is rather confusing, so we try to explain

it better from a design standpoint.

11

Figure 7: AMBA Read/Write Request Timing Diagram [1]

From the slave’s point of view, in the first interval, suppose a “read A” command is

interpreted from the Control and HADDR buses. This causes the slave to feed the input of the

HRDATA register with the data from A and the input of the HREADY register with HIGH

during the same interval. During the second interval, the output of HRDATA register eventually

becomes valid with the data from A, while the output for HREADY becomes HIGH. These

outputs are read by the master during this interval.

From the master’s point of view, in the first interval, HREADY was interpreted as valid

HIGH, acknowledging that the last command finished. Thus, in the same interval, the input of

HADDR is changed to B with the inputs of the Control changing as well for the new command

on B. If the command to address A was a write instead of a read, the input to HWDATA is

driven with the write data for address A. The outputs of these registers (command to B, and write

data for A) become valid during the second interval for the slave to read.

12

This example only shows the read or write of a single address. For the most part the

GRLIB DDR controller operates with sequential bursts of reads and writes, usually

corresponding to the size of a cache line. The sequential burst operation simply increments

address bus each cycle following the initial read or write command. Burst reading is complicated

in actual operation however because of delay when reading from memory. This necessitates the

insertion of wait states by the slave by setting HREADY to LOW.

4.3.2 BEE2 Reference DDR2 Controller Command Interface

The BEE2 DDR2 Controller interface is much simpler than the AHB interface. The

master (the AMBA DDR2 Interface module) issues a command by setting a command valid bit,

the command type (read or write), a command identifier (tag), write data, and write byte enable

mask. The DDR2 controller acknowledges the command by setting a command

acknowledgement bit. This completes a write command. For a read command, the DDR2

controller will also set a read valid bit along with the command tag and data when the read data

is ready. The master will complete the read command by setting a read acknowledgement bit.

4.3.3 AMBA DDR2 Interface Design

Given this background information we can now describe the AMBA DDR2 Interface

design. The design is a finite state machine with an idle state, three writes states, and two read

states. The state descriptions are quite straightforward, however as with most hardware design,

timing is the major source of complexity.

All requests are handled by the IDLE state. In this state, incoming requests are interpreted,

initial steps for dealing with the request are taken, and a state transition is made to the first read

or first write state. As a note, the general actions made in each state are to change the inputs of

registers so that the outputs of the registers reflect these changes in the following cycle.

13

4.3.3.1 Read Requests

When a read request is made in the IDLE state, an AMBA wait state is inserted by setting

HREADY to LOW. The address is stored in a register (craddr) and a read command is issued to

the DDR2 controller for the entire cache line (32-bytes). Thus, the last 5 bits of address are

ignored. The next state is set to READ1.

In READ1, the module waits for the read command acknowledgement from the DDR2

controller. When this is received, the read command is reset, with the command valid bit is set to

LOW so that the command is not issued twice. The module continues in this state until a read

valid is received from the DDR2 controller. When this is detected, HREADY is set to HIGH and

craddr is incremented. As can be seen from the diagram, craddr selects from the DDR2 read data

register to output on the AHB. Thus, the first data is returned in the next cycle. The next state is

set to READ2

In READ2, the rest of the requested data is put on the AHB while the read burst

command is maintained. Finally, when the command has finished or the burst limit is reached

(bursts may not access the next cache line), a read acknowledge signal is sent to the DDR2

controller. HREADY is also set LOW so that the IDLE state can properly deal with the next

command. This last action could be a source of inefficiency as it should be possible to begin

processing the next command in the current state. However, design simplicity was favored over

adding complexity for a bit of performance gain.

14

READ1wait for DRAM read

READ2read out through FIFO

IDLEwait for AHB request

mem_cmd_ack = 1?nmem_cmd_valid := 1

nmem_cmd_addr: = 32'h0nmem_cmd_rnw := 1nmem_wr_be := 36'h0

nmem_wr_din := 288'h0nmem_rd_ack := 0

mem_rd_valid = 1?nhready := 1

nraddr := craddr + 4

is_slave_target and hwrite = 0?nhready := 0

nraddr := haddrmem_cmd_valid := 1

mem_cmd_address := {haddr[31:5], 5'b0}mem_cmd_rnw := 1mem_wr_be := 36'h0

mem_rd_ack := 0

AHB Slave Inputshready = transfer donehsel = slave selecthtrans = transfer typehsize = transfer sizehaddr = addresshwdata = write datahburst = burst typehprot = protection controlhmaster – current masterhmastlock = locked accesshmbsel = memory bank selecthcache = cacheable datahirq = interrupt result bus

Memory Inputsmem_rd_valid = memory read data validmem_rd_tag = memory read tag datamem_rd_dout = memory datamem_cmd_ack = command acknowledgement

READ2nhready := mem_rd_validnraddr := craddr + 4

IDLEnmem_cmd_valid := 0

nmem_cmd_addr := 32'h0nmem_cmd_rnw := 1

nmem_cmd_tag := 32'h0nmem_wr_be := 0

nmem_wr_din := 32'h0nmem_rd_ack := 0

nbe := 32'h0nwaddr := 32'h0nraddr := 8'h0

Conditional Shorthandsahb_stop <= hsel = ‘0’ or htrans != SEQ

is_slave target <= hready = 1 and hsel = 1 and (htrans = SEQ or htrans = NONSEQ)

Notesn prefix = register inputc prefix = register output

decode() = one hot decodemaskdecode() = mask byte offset

ahb_stop or craddr[4:2] = 3'b000?nhready := 0

mem_rd_ack := 1

mem_rd_dout[31:0]

mem_rd_dout[63:32]

mem_rd_dout[103:72]

mem_rd_dout[279:248]

MUX

craddr[4:2]

hrdata[31:0]

ASYNC FIFO

Read Request

15

4.3.3.2 Write Requests

Unlike read requests made in the IDLE state, no AMBA wait state is inserted for writes.

Instead, the write data is written into a write buffer selected by the write address. A bit mask is

also updated to indicate which bytes of the 32-byte cache line are to be written to. The next state

is set to WRITE1.

In WRITE1, the rest of the write buffers are filled along with corresponding bit mask

updates until the write burst finishes. At this point, HREADY is set LOW and the next state is set

to WRITE2.

In the first cycle of WRITE2, the last write buffer has just been written with the last write

data. The command to write the data in the write buffers is then issued to the DDR2 controller

with 8 filler ECC bits interleaved every 8 bytes. The next state is set to WRITE3.

In WRITE3, the module waits for the write command acknowledgement from the DDR2

controller before resetting the memory command and byte enable registers and transitioning back

to IDLE.

16

IDLEwait for AHB request

IDLEnmem_cmd_valid := 0

nmem_cmd_addr := 32'h0nmem_cmd_rnw := 1

nmem_cmd_tag := 32'h0nmem_wr_be := 0

nmem_wr_din := 32'h0nmem_rd_ack :=0

nbe := 32'h0nwaddr := 32'h0nraddr := 8'h00

WRITE1fill write buffers

WRITE3wait for Mem ACK

WRITE2stall AHB

write to Mem

WRITE1nhready := 1nbe := cbe | maskdecode(haddr, hsize)nwaddr := haddrnwritebuff_en := decode(haddr)

ahb_stop or cwaddr[4:2] = 3'b111?nhready := 0

nwritebuff_en := 8'h0

WRITE2nhready := 0nmem_cmd_valid := 1nmem_cmd_addr := {cwaddr[31:5], 5'b0}nmem_cmd_rnw = 0nmem_wr_be := {0, cbe[31:24], …, 0, cbe[7:0]}nmem_wr_din := {8'h0, wdata[255:192], .., 8'h0, wdata[63:0]}nmem_rd_ack := 0

mem_cmd_ack = 1?nmem_cmd_valid := 0nmem_cmd_addr := 0nmem_cmd_rnw = 0

nmem_wr_be := 36'h0nmem_wr_din := 288'h0

nmem_rd_ack := 0nbe := 0

is_slave target and hwrite = 1?hready := 1

nbe := maskdecode(haddr, hsize)nwaddr := haddr

nwritebuff_en := decode(haddr)

write_buff1 wdata[255:224]

cwaddr[7:5]

hwdata[31:0]

ahb_clk

write_buff8

cwritebuff_en[7:0]

[0]

[7]

wdata[31:0]

Memory Outputsmem_cmd_valid = command is validmem_cmd_addr[31:0] = R/W addressmem_cmd_rnw = 1 : read command, 0 write commandmem_cmd_tag[31:0] = command tagmem_wr_be[35:0] = write data byte enablemem_wr_din[287:0] = write data inputmem_rd_ack = read data acknowledgement

AHB Slave Outputshready = ready statehresp[1:0] = response typehrdata = read datahsplit = split completionhcache = cacheable datahirq = interrupt bushconfig = module configuration reghindex = module bus ID

mem_cmd_validmem_cmd_addrmem_cmd_rnwmem_cmd_tagmem_wr_bemem_wr_dinmem_rd_ack

hreadyhresphrdata

cbecwaddrcraddrcwritebuff_en

nmem_cmd_valid

nmem_cmd_addr

nmem_cmd_rnw

nmem_cmd_tag

nmem_wr_be

nmem_wr_din

nmem_rd_ack

nhready

nhresp

nhrdata

nbe

nwaddr

nraddr

nwritebuff_en

Conditional Shorthandsahb_stop <= hsel = ‘0’ or htrans != SEQ

is_slave target <= hready = 1 and hsel = 1 and (htrans = SEQ or htrans = NONSEQ)

Notesn prefix = register inputc prefix = register output

decode() = one hot decodemaskdecode() = mask byte offset

Write Request

17

4.4 Simulation and Testing

Although the background information and design is presented clearly here, a significant

amount of simulation and testing had to be done to fully understand the AHB specification. For

this purpose, a simulation testbench was written to simulate different AHB commands and

command sequences as input for the memory interfaces. The testbench was used to both to learn

about the GRLIB DDR controller to drive the design of the AMBA DDR2 Interface as well as

compare the two to test functionality.

This simulation was done with Mentor Graphics Modelsim in conjunction with actual

testing of the design on the FPGA. A fully synthesized design of the LEON3 system with the

AMBA DDR2 Interface was implemented on the BEE2 board. A range of testing, from simple

reading and writing to memory through GRMON to full GDB style debugging of booting

Snapgear Linux was used to identify bugs in the design and come up with better simulation test

cases.

Testing was completed when the LEON3 system could fully boot Snapgear-p34b and run

multithreaded versions of radix and fft. This version of Snapgear Linux fixed the Pthread library

issue we had discovered earlier.

Figure 8: LEON3 SPLASH-2 Performance. “init” runs take into account time for initialization code.

18

4.5 LEON3 System-on-Chip BEE2 Configuration

4 LEON3 processor cores with SMP 65MHz system clock (75Mhz w/o hierarchy) 32-byte cache line (block) 4KB 2-way I-cache

LRU replacement policy 4KB direct-mapped D-cache Memory management unit Split instruction and data TLB

8 entries each LRU replacement policy

Instruction trace buffer Hardware multiply accumulate unit 200MHz DDR2 clock Round robin AMBA AHB arbitration policy

Figure 9: LEON3 SoC Xilinx Virtex2Pro70 Floorplan

LEON3 System-on-Chip BEE2 Synthesis and Resource Utilization

Resource Utilization Used Available UtilizationLEON3 SoC BEE2 65 Number of Slices 27,775 33,088 0.839428Number of 4-LUTs 49,071 66,176 0.741523Equivalent Gate Count 6,204,015 LEON3 SoC BEE2 75 Number of Slices 27,221 33,088 0.822685Number of 4-LUTs 47,721 66,176 0.721122Equivalent Gate Count 6,199,374 LEON3 Single Core Number of Slices 9,225 33,088 0.278802Number of 4-LUTs 10,615 66,176 0.160406Equivalent Gate Count 1,015,990

Table 2: LEON3 BEE2 Resource Utilization

19

5. LEON3 for ASIC Implementation

Although the modifications made to the LEON3 system allowed implementation on the

BEE2 platform, the design was not ready for ASIC design synthesis. The system relies on

primitives specific to FPGA chips. These include I/O buffers, digital clock generators, on-chip

SRAM, and FIFOs. For ASIC synthesis, these primitives need to be replaced by corresponding

physical design library primitives. Compatibility will depend on the particular modules included

with the physical library.

For this project, we experimented with the libraries available to us: NCSU SOI for MIT

Lincoln Labs based on 180nm technology and UMC 90nm library. Although we had these

standard cell libraries, we did not have access to their respective memory compiler libraries.

Instead, we could only use an older memory compiler, Cadence CRETE, based on 180nm

technology which does not provide timing information.

Even with this limitation, we still had to overcome the absence of any asynchronous

FIFO module in any of the libraries we would be able to obtain, since these primitives are very

specific to the FPGA chip.

These two issues were the main obstacles for ASIC implementation, most other

primitives like I/O buffers and clock buffers can be inferred by the physical design tools. The

main tool we used was Magma Talus. We chose this program because it integrates the synthesis

and implementation steps of the design flow. Thus, we could avoid various tool compatibility

issues if we were to use separate programs for different steps in the flow.

5.1 Physical Memory Conversion

As stated earlier, we did not have access to implementation specific physical memory

compiler that would give us complete memory specifications for the physical design flow.

20

However, we did have a one that could allow us to run through the design flow and give a rough

estimate of area for the chip.

To make use of the memory modules created with the memory compiler, additional

support was required in the source code. Fortunately, because the system is organized as a library,

memories are abstracted from the design and the type of memory used is configurable with a

VHDL generic variable. So, to use the physical memories created from the memory compiler, we

just had to write an interface layer and configure the variables properly. Since the interface for

physical memories in other libraries would likely be very similar to the interface from our

Cadence compiler, very minimal changes to source code would be required to support these

other memories.

With this modification made, we were able to run through the full physical design flow

for a single LEON3 core without other peripherals. We summarize our results for area of a single

LEON3 core as configured in the BEE2 in the following table:

LEON3 ASIC Area Library Technology # Std Cells Area (mm2) NCSU 180nm 28302 2.37UMC 90nm 22972 0.15Cadence CRETE 180nm - 4.15

Table 3: LEON3 ASIC Area Cost. NCSU and UMC show their respective area costs for logic. Cadence CRETE show the cost of memory.

There are orders of magnitude in difference here going from NCSU to UMC in terms of area.

First we notice that there is a rather substantial ~18% decrease in standard logic cell usage for the

design. This can be explained by the larger selection library primitives supported by the UMC

library which allows for more efficient logic mapping. If we assumed the same number of logic

cells for NCSU, a rough calculation would give an expected an area of 1.92mm2. This is still

21

12.8 times larger than the UMC area. With a factor of 2 times reduction in feature size from

NCSU to UMC we could optimally approximate a 4 times reduction in area which still does not

come close to the 12.8 times figure we approximated. We can explain this however, with what

we observe to be the reduction in cell size that UMC implements over NCSU, probably through

vastly more efficient cell design. We see that the cell height in UMC is 2.52nm while it is 12nm

in NCSU. This difference itself would already indicate a 4.7 times decrease in area. To get the

12.8 times reduction we would have to see an average of 2.7 times reduction in cell width. This

now does not seem too unreasonable given what we the reduction we see in cell height.

5.2 Asynchronous FIFO Conversion

The inclusion of the DDR2 controller for the BEE2 design necessitated the use of

asynchronous FIFOs in the FPGA implementation. Unfortunately, this was not included as part

of any of the standard cell libraries available to us, and it is rather unlikely that any of the

physical design libraries would include this type of module as part of their memory compilers.

As a result, quite a bit of modification had to be done to adapt the design for physical

synthesis. The main module affected again was the AMBA DDR2 Interface. This would not be

as simple as designing a FIFO using simple registers and memory libraries. The main issue here

is crossing the AMBA bus/DDR2 clock boundary. The asynchronous FIFOs in the FPGA design

dealt with this seamlessly, however there is not such an elegant solution using only physical

design primitives. Dealing with this would require additional logic and modification of the

original AMBA DDR2 Interface FSM.

5.2.1 AMBA DDR Physical Interface

This report will not go into as much detail as previously in the description of the original

interface. In all, three additional FSMs are needed operating in the DDR2 clock domain. One

22

handles sending the command (read, write, or idle) and address to the DDR2 controller, one

handles demultiplexing 32-byte writes into separate 16-byte writes across clock cycles, while the

other handles multiplexing 16-byte responses into a 32-byte response.

Two additional states, one for reading and one for writing are added to the original

AMBA FSM to interact with the aforementioned DDR FSMs. The complexity here was in

dealing with the asynchronous handshaking between all FSMs so that there was no deadlock or

other incorrect behavior.

The handshaking involves three state variables: ddrstart, ddrstart_ack, and ddrstartold.

The first two variables are in the AMBA clock domain while the last is in the DDR clock domain.

All variables are initialized to 0.

The AMBA FSM initiates a DDR command by negating ddrstart and then waits. This

initiates the DDR command FSM. The FSM starts when ddrstart XOR ddrstartold = 1, copying

the memory command into registers in the DDR domain and sending the command to the DDR

controller. If it is a write, the DDR controller will signal a request for the separate 16-byte data.

This is handled by the write FSM and concludes the memory command in the DDR clock

domain. If it is a read, the controller will signal that data is ready to be read. The separate 16-byte

reads are handled by the read FSM and concludes the read in the DDR clock domain. These read

and write FSMs both set ddrstartold to ddrstart when they complete. On the AMBA side, the

memory command is resumes when ddrstartold XOR ddrstart_ack = 1. The FSM will then set

ddrstart_ack to ddrstartold and complete the AMBA bus transaction as in the original design.

This handshake mechanism follows a similar design to that found in the original GRLIB

DDR controller, but is adapted for use with separate FSMs.

5.3 Simulation and Testing

23

Since, there was no documentation for the BEE2 DDR2 Controller interface without

asynchronous FIFOs, we relied on simulation to learn how to interact with it. Because we had a

working design using the asynchronous FIFOs we could inspect the behavior of this design to

help design the new AMBA Interface. Unfortunately, this also became a source of error and

confusion.

Throughout design and testing, an effort was made to reproduce the FIFO waveforms as

they appear in the simulation environment. Although our design met this criterion, the first cycle

of write input to the DDR controller seemed to be missed in each case. When the design was

modified to get the correct functional behavior in simulation, this led to incorrect behavior when

the design was tested on the FPGA board. In the end, it was discovered that the simulation

environment used did not respect implicit delta delays in VHDL code when registers change

value. This meant that although waveforms were exactly matching those in the previous working

FIFO design (which used Xilinx netlists), internally, the simulator was processing different

values for signals. This necessitated the use of inserting “after” statements to have matching

FPGA and simulation behavior.

Although, at times we were able to run the new design on the FPGA, many times we

were unable to initialize the system after pushing the design onto the board. We were unable to

discover the source of this problem, however it does not appear to be an issue with modification

made to the design. Rather, we believe it may have to do with the FPGA synthesis tools not

placing the AMBA Interface close enough to the DDR2 Controller, since it is known that the

BEE2 DDR2 Controller is very sensitive to placement (which is why the controller itself is

distributed in netlist form for implementation) because of the tight timings that are needed for the

DDR2 clocks on the relatively slower FPGA.

24

However, in the few times we were able to initialize the system on the FPGA, we were

able to verify when our simulation was out of sync and eventually that the final design was able

to perform user-specified reads and writes through the DSU. Unfortunately, in these instances,

the FPGA synthesis tools were unable to meet timing requirements specified for the LEON3’s

system clock. This meant that when we tried to load the Snapgear Linux image, which executed

back-to-back memory operations at system speed, some operations would be lost. Thus, we were

unable to verify Linux functionality.

6. Conclusion

In this project we were able to port the open-source GRLIB LEON3 system-on-chip for

use on the BEE2 board to allow the use of more cores over the Xilinx XUP board and also

enable a faster DDR2 interface for the LEON. This port is also being used in a research project at

Princeton University using the LEON for CMP-emulation.

After further modification, the system is now ready for physical design synthesis. We

have shown preliminary results for 2D synthesis of a single LEON3 core and see it as viable

candidate to be used as real-world application driver for future exploration of 3D physical design

synthesis.

All sources for this project are made available to the public at the UCLA VLSI CAD Lab

website as a Software Release (http://cadlab.cs.ucla.edu/software_release/bee2leon3port).

25

7. References

1. AMBA Specification Rev. 2.0, http://www.arm.com/products/solutions/AMBA_Spec.html (http://www.gaisler.com/doc/amba.pdf) 2. BEE2 – Berkeley Emulation Engine 2, http://bee2.eecs.berkeley.edu/ 3. BEE2wiki, http://bee2.eecs.berkeley.edu/wiki/BEE2wiki.html 4. Gaisler Research GRLIB IP Library, http://www.gaisler.com/cms/index.php?option=com_content&task=section&id=13&Itemid=125 5. GRLIB IP Cores Manual, http://www.gaisler.com/products/grlib/grlib.pdf 6. GRLIB User’s Manual, http://www.gaisler.com/products/grlib/grlib.pdf 7. GRMON 1.1.27b Manual, http://www.gaisler.com/doc/grmon.pdf 8. RAMP – Research Accelerator for Multiple Processors, http://ramp.eecs.berkeley.edu/ 9. Snapgear for LEON manual, ftp://gaisler.com/gaisler.com/linux/linux-2.6/snapgear 10. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24-36, Santa Margherita Ligure, Italy, June 1995 (http://www-flash.stanford.edu/apps/SPLASH/) 11. Xilinx XUP User Manual, http://www.xilinx.com/univ/XUPV2P/Documentation/XUPV2P_User_Guide.pdf

LEON3 System-on-Chip Port for BEE2 and ASIC...

Documents

Transcript of LEON3 System-on-Chip Port for BEE2 and ASIC...