LEON3 System-on-Chip Port for BEE2 and ASIC...
Transcript of LEON3 System-on-Chip Port for BEE2 and ASIC...
1
LEON3 System-on-Chip Port for BEE2 and ASIC Implementation
Timothy Wong
1. Introduction
The LEON3 System-on-Chip platform is part of an open-source IP library from Gaisler
Research, GRLIB [4]. The system and its derivatives have been used both in professional and
academic research applications. The Berkeley Emulation Engine 2 (BEE2) [2] is a multi-chip
FPGA board getting extensive use in research as an enabler for multi-core emulation and
hardware acceleration for research applications. In this report we detail the modification for
LEON3 system for use with the BEE2. The LEON3 implementation of the established SPARC
V8 instruction set architecture (ISA) in combination with its open-source implementation and
attractive software support make it an increasingly popular platform for research applications.
Combined with the high resource and high performance characteristics of the BEE2 allows wider
range of LEON3 system configurations and another option for more application friendly multi-
core research. We also modify the system for ASIC implementation and show viability for use as
a real-world driver for current 2D CAD flows and a possible test bed for future 3D CAD flows.
2. Gaisler Research GRLIB
The GRLIB IP library is distributed under the GNU GPL open-source license from
Gaisler Research. The library includes extensive support for many FPGA boards. It includes the
LEON3 processor core along with debug unit, memory interfaces, and a number of other
peripheral modules that all implement interfaces for ARM AMBA 2.0 bus [1].
The library is attractive because of its full system-on-chip implementation that includes
the LEON3 processor that uses a well supported instruction set (SPARC V8) and is capable of
2
symmetric multiprocessing (SMP). It provides the ability for easier on-chip debugging and a full
compilation toolchain for programs including a memory allocation library and a full Linux 2.6
operating system that can run on the system when implemented on FPGA.
2.1 LEON3 Processor Core Basic Features
• 32-bit SPARC V8 ISA • ARM’s AMBA Advanced High Performance Bus (AHB) interface • 7-stage pipeline • Multiprocessing support (processor ID, snooping) • Separate instruction and data cache
Write-through Configurable associativity, cache line size, replacement policy
• Optional features Memory management unit Hardware FPU, multiply-accumulate (MAC), Debug support
Figure 1: LEON3 Core Block Diagram
3
3. LEON3 on Xilinx XUP
Initial work with the processor was done with the Xilinx XUP board [11]. This board
uses a Virtex2 Pro 30 FPGA with a DDR memory interface and JTAG, RS232, and 10/100
Ethernet PHY connections as peripheral I/O options. As this board is supported in the GRLIB
library as a reference design, the main objective of using this board was to serve as a test
platform. We wanted to get the system operational verify that it could run the provided Linux
implementation and run multithreaded programs. We also wanted to determine the approximate
size of the core and see how many could fit on the FPGA.
As stated previously, the XUP board is supported by the GRLIB library as a reference
design. Therefore, getting basic functionality was a matter of familiarizing oneself with the
VHDL code base and hardware configuration options as well as configuring and compiling the
recommended Snapgear Linux operating system.
3.1 LEON3 System-on-Chip XUP Configuration
2 LEON3 processor cores with SMP 65MHz system clock with XST 32-byte cache line (block) 4KB 2-way I-cache
LRU replacement policy 4KB direct-mapped D-cache Memory management unit Split instruction and data TLB
8 entries each LRU replacement policy
Instruction trace buffer Hardware multiply accumulate unit 100MHz DDR clock Round robin AMBA AHB arbitration policy
Figure 2: LEON3 SoC Xilinx Virtex2Pro30 Floorplan
4
Figure 3: LEON3 System-on-Chip XUP Block Diagram
3.2 LEON3 System-on-Chip Xilinx XUP Synthesis and Resource Utilization
The main application used for synthesis of the LEON3 system was the Xilinx XST
synthesis tool in the Xilinx ISE 9.1i development environment. Attempts to use other tools for
synthesis such as Synplify’s Synplicity were met with incompatibilities with Xilinx’s mapping
and place and route tools used to implement the design for FPGA.
Resource Utilization Used Available Utilization LEON3 SoC XUP
Number of Slices 13,309 13,696 0.971744
Number of 4-LUTs 25,431 27,392 0.92841Equivalent Gate Count 2,591,690 LEON3 Single Core Number of Slices 9,225 13,696 0.673554Number of 4-LUTs 10,615 27,392 0.387522
Equivalent Gate Count 1,015,990 Table 1: LEON3 XUP Resource Utilization
5
Each core in the design uses roughly 39% of the available hardware logic resources (4-LUTs).
Since the system requires that the number of cores in the system be a power of 2, it was
impossible to synthesize more than 2 cores for this FPGA without running out of resources.
Having 2 cores however, enabled testing of the software toolchain, Snapgear Linux in SMP
mode, and multithreaded programs.
3.3 Debug Tool chain
Gaisler Research provides two simulation environments, TSIM and GRSIM, that can
simulate LEON processors of different configurations. Either can be used alone or used in a
larger framework, for example connecting to the GNU GDB debugger. However, there was not
too interest in the internal behavior of the processor core and there were no complications
bringing the system up that necessitated the need to experiment with these simulators.
The main debug tool for GRLIB is the GRMON Debug Monitor [7] available for Linux
or with graphical interface in Linux and Windows with Java. The software connects to system
through JTAG, serial, or Ethernet interfaces. Through this I/O connection, the software interacts
with the system through the Debug Support Unit (DSU) on the chip. The DSU must be
implemented for GRMON to function.
Through this framework, the user can list the system configuration to confirm the
presence of modules in the system and view their configuration. This is done by reading status
registers in the modules that are implemented at hardware synthesis. These configuration
registers have their own bus lines and can be read easily by the bus arbitration unit and bus
masters (DSU with LEON). Because these are just implemented as registers, one could also use
them for other purposes like performance counters. Since the AMBA bus design assigns a
memory space to each AMBA slave, registers can alternatively be mapped to memory addresses.
6
So, reading and writing status registers can be no different from reading and writing actual
memory. Clearly, this bus design also allows the user to read and write any part of memory.
GRMON also has control over execution of the processor. One can load programs into
memory, run them, step through instructions, and set breakpoints and watchpoints much like
GDB. Symbols can also be loaded into the program for easier debugging. The process of booting
Snapgear Linux on the system is to load the compiled operating system image into RAM, set the
program counter to the beginning of the image, and initiate execution on the LEON.
3.4 Software Environment
The version of Linux that was used on this system was the latest version as of 6/07,
Snagear-p33a based on the Linux 2.6 kernel [9]. A graphical user interface is provided to
configure operating system services and choose basic system applications to install before
compiling. For the purposes of this project, the only options of importance were to enable SMP
support, networking services, and to choose a C library to implement (either the glibc or uClibc).
Custom applications should be compiled with the operating system and included in the
resulting image file. Source code must be added to the “user/custom” directory along with the
appropriate compilation directives in the directory’s Makefile. In order to test SMP functionality,
we compiled two multithreaded applications from the SPLASH-2 benchmark suite [10], radix
and fft, to be loaded with the operating system.
Once compiled into an image, the operating system is booted with GRMON using the
method described previously. By default, the operating system connects to the console to the
FPGA board’s serial port. The serial port is connected to another computer, where minicom was
used to interact with the system.
7
Despite the mostly smooth operation after learning the toolchain, we found that the
POSIX thread (Pthread) library was not functional. Although single-threaded application ran
without error, multithreaded applications would halt immediately at the Pthread function
invocations. Several versions of Snapgear were tested with different configurations using glibc
and uClibc. Changing the multithreading macros in the application also did not fix the problem.
We notified the developers of this bug and decided to proceed to the next step anticipating a fix
with a later release of the operating system.
4. LEON3 on Berkeley Emulation Engine 2 (BEE2)
As noted previously, the XUP board only provided enough resources for two LEON3
cores to be synthesized. We decided then to move to a larger FPGA. The BEE2 provides 5
Virtex2Pro 70 FPGAs, each with more logic and memory resources than the Virtex2Pro 30
found on the XUP board. In addition, the board provides the same JTAG, serial, and Ethernet I/O
ports needed for the LEON system. Unlike the XUP board however, each FPGA has connections
to 4 DDR2 memory modules, has different I/O pin locations, and uses an external system clock.
To overcome these last items, significant modification to the system memory interface needed as
well as detailed knowledge of the BEE2
reference modules and board design.
Since the BEE2 is mainly
developed for and used by the academic
audience, reference documentation was
relatively difficult to locate and not
particularly user-friendly. Fortunately,
our previous work implementing the
Figure 4: Control FPGA I/O Diagram [3]
8
OpenRisc 1200 core on the BEE2 helped immensely to deal with the board infrastructure bring
up such as dealing with the external system clock and identifying the appropriate board pin
locations for the design.
The different memory interface however, required designing new modules to interface
between the AHB and DDR2 memory. The first steps were to get familiar with the BEE2 DDR2
Controller reference design and the GRLIB DDR Controller design to see where the new design
would fit in.
4.1 BEE2 Reference DDR2 Controller
The BEE2 reference DDR2 Controller was originally designed for use with the IBM
PowerPC processors on the Virtex2Pro FPGA. The controller is partitioned into several parts: an
interface for IBM’s CoreConnect Processor Local Bus (PLB), an arbiter for forwarding requests
from multiple modules, an asynchronous FIFO interface for crossing the core/memory clock
boundary, the DDR2 controller itself, and infrastructure to generate and propagate the DDR2
clock.
4.2 GRLIB DDR Controller
The GRLIB DDR Controller is partitioned somewhat differently from the BEE2 design.
It consists of a DDR controller and a DDR PHY. The DDR controller module implements a
slave interface to process AHB requests and an interface to the DDR PHY to output the
appropriate DDR commands. Thus, the controller has two clock inputs and works across the
clock boundary between the AHB clock and DDR clock. Since GRLIB is meant to support
multiple platforms, the DDR PHY is used to provide the appropriate platform dependent DDR
clock generation and I/O buffering.
9
Figure 5: BEE2 Reference DDR2 Controller vs. GRLIB DDR Controller Block Diagram
4.3 AMBA DDR2 Interface
After analyzing the two designs, it was clear that the DDR2 controller and infrastructure
from the BEE2 design should be reused rather than a new DDR2 controller developed from
scratch. The asynchronous FIFO design was also a convenient way to cross the AHB and DDR2
clock domains. As a result, the new module had to take AHB slave requests and convert them to
the appropriate memory commands for the DDR2 controller. It also had to take the DDR2
controller response and generate the appropriate AHB slave response. The GRLIB DDR
controller would serve as a basis for comparison with the new module in terms of interpreting
and generating AHB signals.
10
Figure 6: LEON3 System-on-Chip BEE2 Block Diagram
4.3.1 AMBA 2.0 AHB Specification
To understand the design of the AMBA DDR2 Interface module, it is necessary to
understand the AHB specification [1]. For this module, we are mainly concerned with the slave
request and response mechanism. What complicates the design is the pipelined nature of the bus
specification. Rather than a 2-cycle request/response mechanism, transactions are pipelined.
Requests are made while responses from previous requests are being returned. We provide a
sample AMBA timing diagram (Figure 7).
In diagram, the “Control” signal corresponds to a read or write command with burst
length and data width. HREADY and HRDATA are slave responses to the command. The timing
as explained in the AMBA 2.0 technical documentation is rather confusing, so we try to explain
it better from a design standpoint.
11
Figure 7: AMBA Read/Write Request Timing Diagram [1]
From the slave’s point of view, in the first interval, suppose a “read A” command is
interpreted from the Control and HADDR buses. This causes the slave to feed the input of the
HRDATA register with the data from A and the input of the HREADY register with HIGH
during the same interval. During the second interval, the output of HRDATA register eventually
becomes valid with the data from A, while the output for HREADY becomes HIGH. These
outputs are read by the master during this interval.
From the master’s point of view, in the first interval, HREADY was interpreted as valid
HIGH, acknowledging that the last command finished. Thus, in the same interval, the input of
HADDR is changed to B with the inputs of the Control changing as well for the new command
on B. If the command to address A was a write instead of a read, the input to HWDATA is
driven with the write data for address A. The outputs of these registers (command to B, and write
data for A) become valid during the second interval for the slave to read.
12
This example only shows the read or write of a single address. For the most part the
GRLIB DDR controller operates with sequential bursts of reads and writes, usually
corresponding to the size of a cache line. The sequential burst operation simply increments
address bus each cycle following the initial read or write command. Burst reading is complicated
in actual operation however because of delay when reading from memory. This necessitates the
insertion of wait states by the slave by setting HREADY to LOW.
4.3.2 BEE2 Reference DDR2 Controller Command Interface
The BEE2 DDR2 Controller interface is much simpler than the AHB interface. The
master (the AMBA DDR2 Interface module) issues a command by setting a command valid bit,
the command type (read or write), a command identifier (tag), write data, and write byte enable
mask. The DDR2 controller acknowledges the command by setting a command
acknowledgement bit. This completes a write command. For a read command, the DDR2
controller will also set a read valid bit along with the command tag and data when the read data
is ready. The master will complete the read command by setting a read acknowledgement bit.
4.3.3 AMBA DDR2 Interface Design
Given this background information we can now describe the AMBA DDR2 Interface
design. The design is a finite state machine with an idle state, three writes states, and two read
states. The state descriptions are quite straightforward, however as with most hardware design,
timing is the major source of complexity.
All requests are handled by the IDLE state. In this state, incoming requests are interpreted,
initial steps for dealing with the request are taken, and a state transition is made to the first read
or first write state. As a note, the general actions made in each state are to change the inputs of
registers so that the outputs of the registers reflect these changes in the following cycle.
13
4.3.3.1 Read Requests
When a read request is made in the IDLE state, an AMBA wait state is inserted by setting
HREADY to LOW. The address is stored in a register (craddr) and a read command is issued to
the DDR2 controller for the entire cache line (32-bytes). Thus, the last 5 bits of address are
ignored. The next state is set to READ1.
In READ1, the module waits for the read command acknowledgement from the DDR2
controller. When this is received, the read command is reset, with the command valid bit is set to
LOW so that the command is not issued twice. The module continues in this state until a read
valid is received from the DDR2 controller. When this is detected, HREADY is set to HIGH and
craddr is incremented. As can be seen from the diagram, craddr selects from the DDR2 read data
register to output on the AHB. Thus, the first data is returned in the next cycle. The next state is
set to READ2
In READ2, the rest of the requested data is put on the AHB while the read burst
command is maintained. Finally, when the command has finished or the burst limit is reached
(bursts may not access the next cache line), a read acknowledge signal is sent to the DDR2
controller. HREADY is also set LOW so that the IDLE state can properly deal with the next
command. This last action could be a source of inefficiency as it should be possible to begin
processing the next command in the current state. However, design simplicity was favored over
adding complexity for a bit of performance gain.
14
READ1wait for DRAM read
READ2read out through FIFO
IDLEwait for AHB request
mem_cmd_ack = 1?nmem_cmd_valid := 1
nmem_cmd_addr: = 32'h0nmem_cmd_rnw := 1nmem_wr_be := 36'h0
nmem_wr_din := 288'h0nmem_rd_ack := 0
mem_rd_valid = 1?nhready := 1
nraddr := craddr + 4
is_slave_target and hwrite = 0?nhready := 0
nraddr := haddrmem_cmd_valid := 1
mem_cmd_address := {haddr[31:5], 5'b0}mem_cmd_rnw := 1mem_wr_be := 36'h0
mem_rd_ack := 0
AHB Slave Inputshready = transfer donehsel = slave selecthtrans = transfer typehsize = transfer sizehaddr = addresshwdata = write datahburst = burst typehprot = protection controlhmaster – current masterhmastlock = locked accesshmbsel = memory bank selecthcache = cacheable datahirq = interrupt result bus
Memory Inputsmem_rd_valid = memory read data validmem_rd_tag = memory read tag datamem_rd_dout = memory datamem_cmd_ack = command acknowledgement
READ2nhready := mem_rd_validnraddr := craddr + 4
IDLEnmem_cmd_valid := 0
nmem_cmd_addr := 32'h0nmem_cmd_rnw := 1
nmem_cmd_tag := 32'h0nmem_wr_be := 0
nmem_wr_din := 32'h0nmem_rd_ack := 0
nbe := 32'h0nwaddr := 32'h0nraddr := 8'h0
Conditional Shorthandsahb_stop <= hsel = ‘0’ or htrans != SEQ
is_slave target <= hready = 1 and hsel = 1 and (htrans = SEQ or htrans = NONSEQ)
Notesn prefix = register inputc prefix = register output
decode() = one hot decodemaskdecode() = mask byte offset
ahb_stop or craddr[4:2] = 3'b000?nhready := 0
mem_rd_ack := 1
mem_rd_dout[31:0]
mem_rd_dout[63:32]
mem_rd_dout[103:72]
mem_rd_dout[279:248]
MUX
craddr[4:2]
hrdata[31:0]
ASYNC FIFO
Read Request
15
4.3.3.2 Write Requests
Unlike read requests made in the IDLE state, no AMBA wait state is inserted for writes.
Instead, the write data is written into a write buffer selected by the write address. A bit mask is
also updated to indicate which bytes of the 32-byte cache line are to be written to. The next state
is set to WRITE1.
In WRITE1, the rest of the write buffers are filled along with corresponding bit mask
updates until the write burst finishes. At this point, HREADY is set LOW and the next state is set
to WRITE2.
In the first cycle of WRITE2, the last write buffer has just been written with the last write
data. The command to write the data in the write buffers is then issued to the DDR2 controller
with 8 filler ECC bits interleaved every 8 bytes. The next state is set to WRITE3.
In WRITE3, the module waits for the write command acknowledgement from the DDR2
controller before resetting the memory command and byte enable registers and transitioning back
to IDLE.
16
IDLEwait for AHB request
IDLEnmem_cmd_valid := 0
nmem_cmd_addr := 32'h0nmem_cmd_rnw := 1
nmem_cmd_tag := 32'h0nmem_wr_be := 0
nmem_wr_din := 32'h0nmem_rd_ack :=0
nbe := 32'h0nwaddr := 32'h0nraddr := 8'h00
WRITE1fill write buffers
WRITE3wait for Mem ACK
WRITE2stall AHB
write to Mem
WRITE1nhready := 1nbe := cbe | maskdecode(haddr, hsize)nwaddr := haddrnwritebuff_en := decode(haddr)
ahb_stop or cwaddr[4:2] = 3'b111?nhready := 0
nwritebuff_en := 8'h0
WRITE2nhready := 0nmem_cmd_valid := 1nmem_cmd_addr := {cwaddr[31:5], 5'b0}nmem_cmd_rnw = 0nmem_wr_be := {0, cbe[31:24], …, 0, cbe[7:0]}nmem_wr_din := {8'h0, wdata[255:192], .., 8'h0, wdata[63:0]}nmem_rd_ack := 0
mem_cmd_ack = 1?nmem_cmd_valid := 0nmem_cmd_addr := 0nmem_cmd_rnw = 0
nmem_wr_be := 36'h0nmem_wr_din := 288'h0
nmem_rd_ack := 0nbe := 0
is_slave target and hwrite = 1?hready := 1
nbe := maskdecode(haddr, hsize)nwaddr := haddr
nwritebuff_en := decode(haddr)
write_buff1 wdata[255:224]
cwaddr[7:5]
hwdata[31:0]
ahb_clk
write_buff8
cwritebuff_en[7:0]
[0]
[7]
wdata[31:0]
Memory Outputsmem_cmd_valid = command is validmem_cmd_addr[31:0] = R/W addressmem_cmd_rnw = 1 : read command, 0 write commandmem_cmd_tag[31:0] = command tagmem_wr_be[35:0] = write data byte enablemem_wr_din[287:0] = write data inputmem_rd_ack = read data acknowledgement
AHB Slave Outputshready = ready statehresp[1:0] = response typehrdata = read datahsplit = split completionhcache = cacheable datahirq = interrupt bushconfig = module configuration reghindex = module bus ID
mem_cmd_validmem_cmd_addrmem_cmd_rnwmem_cmd_tagmem_wr_bemem_wr_dinmem_rd_ack
hreadyhresphrdata
cbecwaddrcraddrcwritebuff_en
nmem_cmd_valid
nmem_cmd_addr
nmem_cmd_rnw
nmem_cmd_tag
nmem_wr_be
nmem_wr_din
nmem_rd_ack
nhready
nhresp
nhrdata
nbe
nwaddr
nraddr
nwritebuff_en
Conditional Shorthandsahb_stop <= hsel = ‘0’ or htrans != SEQ
is_slave target <= hready = 1 and hsel = 1 and (htrans = SEQ or htrans = NONSEQ)
Notesn prefix = register inputc prefix = register output
decode() = one hot decodemaskdecode() = mask byte offset
Write Request
17
4.4 Simulation and Testing
Although the background information and design is presented clearly here, a significant
amount of simulation and testing had to be done to fully understand the AHB specification. For
this purpose, a simulation testbench was written to simulate different AHB commands and
command sequences as input for the memory interfaces. The testbench was used to both to learn
about the GRLIB DDR controller to drive the design of the AMBA DDR2 Interface as well as
compare the two to test functionality.
This simulation was done with Mentor Graphics Modelsim in conjunction with actual
testing of the design on the FPGA. A fully synthesized design of the LEON3 system with the
AMBA DDR2 Interface was implemented on the BEE2 board. A range of testing, from simple
reading and writing to memory through GRMON to full GDB style debugging of booting
Snapgear Linux was used to identify bugs in the design and come up with better simulation test
cases.
Testing was completed when the LEON3 system could fully boot Snapgear-p34b and run
multithreaded versions of radix and fft. This version of Snapgear Linux fixed the Pthread library
issue we had discovered earlier.
Figure 8: LEON3 SPLASH-2 Performance. “init” runs take into account time for initialization code.
18
4.5 LEON3 System-on-Chip BEE2 Configuration
4 LEON3 processor cores with SMP 65MHz system clock (75Mhz w/o hierarchy) 32-byte cache line (block) 4KB 2-way I-cache
LRU replacement policy 4KB direct-mapped D-cache Memory management unit Split instruction and data TLB
8 entries each LRU replacement policy
Instruction trace buffer Hardware multiply accumulate unit 200MHz DDR2 clock Round robin AMBA AHB arbitration policy
Figure 9: LEON3 SoC Xilinx Virtex2Pro70 Floorplan
LEON3 System-on-Chip BEE2 Synthesis and Resource Utilization
Resource Utilization Used Available UtilizationLEON3 SoC BEE2 65 Number of Slices 27,775 33,088 0.839428Number of 4-LUTs 49,071 66,176 0.741523Equivalent Gate Count 6,204,015 LEON3 SoC BEE2 75 Number of Slices 27,221 33,088 0.822685Number of 4-LUTs 47,721 66,176 0.721122Equivalent Gate Count 6,199,374 LEON3 Single Core Number of Slices 9,225 33,088 0.278802Number of 4-LUTs 10,615 66,176 0.160406Equivalent Gate Count 1,015,990
Table 2: LEON3 BEE2 Resource Utilization
19
5. LEON3 for ASIC Implementation
Although the modifications made to the LEON3 system allowed implementation on the
BEE2 platform, the design was not ready for ASIC design synthesis. The system relies on
primitives specific to FPGA chips. These include I/O buffers, digital clock generators, on-chip
SRAM, and FIFOs. For ASIC synthesis, these primitives need to be replaced by corresponding
physical design library primitives. Compatibility will depend on the particular modules included
with the physical library.
For this project, we experimented with the libraries available to us: NCSU SOI for MIT
Lincoln Labs based on 180nm technology and UMC 90nm library. Although we had these
standard cell libraries, we did not have access to their respective memory compiler libraries.
Instead, we could only use an older memory compiler, Cadence CRETE, based on 180nm
technology which does not provide timing information.
Even with this limitation, we still had to overcome the absence of any asynchronous
FIFO module in any of the libraries we would be able to obtain, since these primitives are very
specific to the FPGA chip.
These two issues were the main obstacles for ASIC implementation, most other
primitives like I/O buffers and clock buffers can be inferred by the physical design tools. The
main tool we used was Magma Talus. We chose this program because it integrates the synthesis
and implementation steps of the design flow. Thus, we could avoid various tool compatibility
issues if we were to use separate programs for different steps in the flow.
5.1 Physical Memory Conversion
As stated earlier, we did not have access to implementation specific physical memory
compiler that would give us complete memory specifications for the physical design flow.
20
However, we did have a one that could allow us to run through the design flow and give a rough
estimate of area for the chip.
To make use of the memory modules created with the memory compiler, additional
support was required in the source code. Fortunately, because the system is organized as a library,
memories are abstracted from the design and the type of memory used is configurable with a
VHDL generic variable. So, to use the physical memories created from the memory compiler, we
just had to write an interface layer and configure the variables properly. Since the interface for
physical memories in other libraries would likely be very similar to the interface from our
Cadence compiler, very minimal changes to source code would be required to support these
other memories.
With this modification made, we were able to run through the full physical design flow
for a single LEON3 core without other peripherals. We summarize our results for area of a single
LEON3 core as configured in the BEE2 in the following table:
LEON3 ASIC Area Library Technology # Std Cells Area (mm2) NCSU 180nm 28302 2.37UMC 90nm 22972 0.15Cadence CRETE 180nm - 4.15
Table 3: LEON3 ASIC Area Cost. NCSU and UMC show their respective area costs for logic. Cadence CRETE show the cost of memory.
There are orders of magnitude in difference here going from NCSU to UMC in terms of area.
First we notice that there is a rather substantial ~18% decrease in standard logic cell usage for the
design. This can be explained by the larger selection library primitives supported by the UMC
library which allows for more efficient logic mapping. If we assumed the same number of logic
cells for NCSU, a rough calculation would give an expected an area of 1.92mm2. This is still
21
12.8 times larger than the UMC area. With a factor of 2 times reduction in feature size from
NCSU to UMC we could optimally approximate a 4 times reduction in area which still does not
come close to the 12.8 times figure we approximated. We can explain this however, with what
we observe to be the reduction in cell size that UMC implements over NCSU, probably through
vastly more efficient cell design. We see that the cell height in UMC is 2.52nm while it is 12nm
in NCSU. This difference itself would already indicate a 4.7 times decrease in area. To get the
12.8 times reduction we would have to see an average of 2.7 times reduction in cell width. This
now does not seem too unreasonable given what we the reduction we see in cell height.
5.2 Asynchronous FIFO Conversion
The inclusion of the DDR2 controller for the BEE2 design necessitated the use of
asynchronous FIFOs in the FPGA implementation. Unfortunately, this was not included as part
of any of the standard cell libraries available to us, and it is rather unlikely that any of the
physical design libraries would include this type of module as part of their memory compilers.
As a result, quite a bit of modification had to be done to adapt the design for physical
synthesis. The main module affected again was the AMBA DDR2 Interface. This would not be
as simple as designing a FIFO using simple registers and memory libraries. The main issue here
is crossing the AMBA bus/DDR2 clock boundary. The asynchronous FIFOs in the FPGA design
dealt with this seamlessly, however there is not such an elegant solution using only physical
design primitives. Dealing with this would require additional logic and modification of the
original AMBA DDR2 Interface FSM.
5.2.1 AMBA DDR Physical Interface
This report will not go into as much detail as previously in the description of the original
interface. In all, three additional FSMs are needed operating in the DDR2 clock domain. One
22
handles sending the command (read, write, or idle) and address to the DDR2 controller, one
handles demultiplexing 32-byte writes into separate 16-byte writes across clock cycles, while the
other handles multiplexing 16-byte responses into a 32-byte response.
Two additional states, one for reading and one for writing are added to the original
AMBA FSM to interact with the aforementioned DDR FSMs. The complexity here was in
dealing with the asynchronous handshaking between all FSMs so that there was no deadlock or
other incorrect behavior.
The handshaking involves three state variables: ddrstart, ddrstart_ack, and ddrstartold.
The first two variables are in the AMBA clock domain while the last is in the DDR clock domain.
All variables are initialized to 0.
The AMBA FSM initiates a DDR command by negating ddrstart and then waits. This
initiates the DDR command FSM. The FSM starts when ddrstart XOR ddrstartold = 1, copying
the memory command into registers in the DDR domain and sending the command to the DDR
controller. If it is a write, the DDR controller will signal a request for the separate 16-byte data.
This is handled by the write FSM and concludes the memory command in the DDR clock
domain. If it is a read, the controller will signal that data is ready to be read. The separate 16-byte
reads are handled by the read FSM and concludes the read in the DDR clock domain. These read
and write FSMs both set ddrstartold to ddrstart when they complete. On the AMBA side, the
memory command is resumes when ddrstartold XOR ddrstart_ack = 1. The FSM will then set
ddrstart_ack to ddrstartold and complete the AMBA bus transaction as in the original design.
This handshake mechanism follows a similar design to that found in the original GRLIB
DDR controller, but is adapted for use with separate FSMs.
5.3 Simulation and Testing
23
Since, there was no documentation for the BEE2 DDR2 Controller interface without
asynchronous FIFOs, we relied on simulation to learn how to interact with it. Because we had a
working design using the asynchronous FIFOs we could inspect the behavior of this design to
help design the new AMBA Interface. Unfortunately, this also became a source of error and
confusion.
Throughout design and testing, an effort was made to reproduce the FIFO waveforms as
they appear in the simulation environment. Although our design met this criterion, the first cycle
of write input to the DDR controller seemed to be missed in each case. When the design was
modified to get the correct functional behavior in simulation, this led to incorrect behavior when
the design was tested on the FPGA board. In the end, it was discovered that the simulation
environment used did not respect implicit delta delays in VHDL code when registers change
value. This meant that although waveforms were exactly matching those in the previous working
FIFO design (which used Xilinx netlists), internally, the simulator was processing different
values for signals. This necessitated the use of inserting “after” statements to have matching
FPGA and simulation behavior.
Although, at times we were able to run the new design on the FPGA, many times we
were unable to initialize the system after pushing the design onto the board. We were unable to
discover the source of this problem, however it does not appear to be an issue with modification
made to the design. Rather, we believe it may have to do with the FPGA synthesis tools not
placing the AMBA Interface close enough to the DDR2 Controller, since it is known that the
BEE2 DDR2 Controller is very sensitive to placement (which is why the controller itself is
distributed in netlist form for implementation) because of the tight timings that are needed for the
DDR2 clocks on the relatively slower FPGA.
24
However, in the few times we were able to initialize the system on the FPGA, we were
able to verify when our simulation was out of sync and eventually that the final design was able
to perform user-specified reads and writes through the DSU. Unfortunately, in these instances,
the FPGA synthesis tools were unable to meet timing requirements specified for the LEON3’s
system clock. This meant that when we tried to load the Snapgear Linux image, which executed
back-to-back memory operations at system speed, some operations would be lost. Thus, we were
unable to verify Linux functionality.
6. Conclusion
In this project we were able to port the open-source GRLIB LEON3 system-on-chip for
use on the BEE2 board to allow the use of more cores over the Xilinx XUP board and also
enable a faster DDR2 interface for the LEON. This port is also being used in a research project at
Princeton University using the LEON for CMP-emulation.
After further modification, the system is now ready for physical design synthesis. We
have shown preliminary results for 2D synthesis of a single LEON3 core and see it as viable
candidate to be used as real-world application driver for future exploration of 3D physical design
synthesis.
All sources for this project are made available to the public at the UCLA VLSI CAD Lab
website as a Software Release (http://cadlab.cs.ucla.edu/software_release/bee2leon3port).
25
7. References
1. AMBA Specification Rev. 2.0, http://www.arm.com/products/solutions/AMBA_Spec.html (http://www.gaisler.com/doc/amba.pdf) 2. BEE2 – Berkeley Emulation Engine 2, http://bee2.eecs.berkeley.edu/ 3. BEE2wiki, http://bee2.eecs.berkeley.edu/wiki/BEE2wiki.html 4. Gaisler Research GRLIB IP Library, http://www.gaisler.com/cms/index.php?option=com_content&task=section&id=13&Itemid=125 5. GRLIB IP Cores Manual, http://www.gaisler.com/products/grlib/grlib.pdf 6. GRLIB User’s Manual, http://www.gaisler.com/products/grlib/grlib.pdf 7. GRMON 1.1.27b Manual, http://www.gaisler.com/doc/grmon.pdf 8. RAMP – Research Accelerator for Multiple Processors, http://ramp.eecs.berkeley.edu/ 9. Snapgear for LEON manual, ftp://gaisler.com/gaisler.com/linux/linux-2.6/snapgear 10. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24-36, Santa Margherita Ligure, Italy, June 1995 (http://www-flash.stanford.edu/apps/SPLASH/) 11. Xilinx XUP User Manual, http://www.xilinx.com/univ/XUPV2P/Documentation/XUPV2P_User_Guide.pdf