RISC processor implementation using Bluespec part 2 - final presentation

RISC PROCESSOR IMPLEMENTATION USING BLUESPEC

PART 2 - FINAL PRESENTATION

Performed By: Yahel Ben-Avraham and Yaron RimmerInstructor: Mony Orbach

Bi-semesterial, 2012 - 2014

30/3/2014

Project goals Goal: Implementing and analyzing RISC

Processor using Bluespec Verilog Part A:

Studying the working environment, BSV language and the basic processor implementation.

Implementing a simple RISC processor.Run a simple test bench on the FPGA system.

Project goals Goal: Implementing and analyzing RISC

Processor using Bluespec Verilog Part B:

Ramp up the design: Wider instruction set Branch prediction (and flushing) Hazard detection unit and extended Data

forwarding Performance counters

Run the design on the FPGA system

Pipeline Datapath

FETCH DEC EXE MEM1 WB

Instruction Memory

Register File

Memory

MEM2

Forwarding

Branch Predictor

Fetch Tag the instruction’s metadata (PC, cycle) Fetch the requested instruction from the

instruction memory Update next PC

Get next PC’s branch prediction and branch addressCheck for Jump command

Decode Fully parse the received instruction Pre-fetch data from registers potentially in use

Execute According to the instruction’s opcode:

ALU instruction: compute the resultMemory instruction: calculate memory address to

read / write toBranch instruction: check if branch is taken and

update branch resolution Data forwarding

Memory 1 Send a read / write request to the BRAM

Write : data is immediately storedRead: wait for response in the next cycle

Otherwise, pass the incoming data

Memory 2 (mem / skipmem) Implemented in two rules:

For memory read: get BRAM responseOtherwise, pass the incoming struct

Writeback Save needed data to the register file

Register 0 – read only Communication with the wrapper

Data and statistics

Branch Prediction 2-bit saturated, local counter (initialized to WNT) Prediction is acquired in the Fetch stage

Stored and passed along the pipeline Branch resolution determined in the Exec stage

BP is updated accordingly Wrong prediction?

Correction PCFlushing Dec & Exe

Forwarding 4 global Forwarding registers

Each containing (when valid) address, value, cycle Writing - in the end of Exec stage Reading - in the beginning of Exec stage Invalidating - by aging after the Exec stage


Instruction Memory

Register File

Memory

MEM2

Forwarding

Branch Predictor

Forwarding – cont. Special case: register read after memory load

Stalling registers holding the address to be read toIf needed – stall the Exec stage by keeping the

current command in the dec/exec FIFO


Instruction Memory

Register File

Memory

MEM2

Forwarding

Branch Predictor

The working environment Xilinx FPGA development board – of

Virtex 5 familyProgramming to FPGA using JTAGCommunication with DUT using PCIE

The platform enables:Synthesis of design to FPGAReading and writing to memoriesPerformance counters

The platform

SCEMI’s working methods“Standard Co-Emulation Modeling Interface” 2 working methods

TCP/IP simulationFPGA emulation

Establishes port on SW end to FIFO on HW end communication

Parcels (data structs) are delivered in both directions

System layers – PCIE simulation

FPGA

SCEMI – DUT to PCIE

DUT: Wrapper

Datapath

PCLinux O.S.

C++ Executable: TB

Input files

PCIE

System layers – TCP\IP emulation

FPGASCEMI – DUT to PCIE

DUT: Wrapper

Datapath

PCLinux O.S.

C++ Executable: TB

Input files

PCIE

DUT: Bsim_dut

TCP\IP

Our SCEMI platform – SW side A compiled C++ code (TB) is loaded with input files Sends and receives messages from the DUT using

incoming \ outgoing ports We chose to use a “Stop & Wait” protocol Performs the following actions:

Loads the DUT’s instruction memoryLoads the DUT’s register fileSignals the DUT to runWhen done, collecting relevant information

Register file Run statistics

Our SCEMI platform – HW side Our top level module (Wrapper, which is our DUT) Receiving and sending messages to the TB using

FIFOs Contains the Datapath itself as a black box Performs commands from the TB

Loads the instruction memory and the register fileInitiates all the registers and starts \ stops the run of the

datapath Receives data from the datapath (from the WB

stage) and relay it back to the TB

Putting the design to the test As a concluding test, we wrote a Bubble

Sort in assembly, loading 10 unsorted numbers into the memory, then using bubble sort and displaying them in the register file.

The code uses almostall the instruction set, and practicallyevery feature in thedesign.

for (i = 0; i < length -1; ++i) { for (j = 0; j < length - i - 1; ++j) { if (array[j] > array[j + 1]) { int tmp = array[j]; array[j] = array[j + 1]; array[j + 1] = tmp; } } }

Critical example – Bubble sort The program works successfully in the

BSV simulation and the TCP\IP simulation.

Results are incorrect in the PCIE emulation.

Critical example – Bubble sortFPGA result Expected result – TCP\IP

Isolating the problem Trying to isolate the problem – store 4

numbers, and read them into the register file4 ADDI , 4 STORE , 4 LOAD

Encountered unexplained yet repeating results

This is only one of many debugging attempts

Isolating the problem Expected result:

consistent with simulation FPGA result:

Padding with 1 NOP:between ADDI and ST

Padding with 2 or more NOPS:

Further investigation Dismissing possible issues

Design fault – works flawlessly in simulationsClearing the design between runs

Investigating xilinx compilation filesPlace and route – margins are positiveNo note-worthy warnings

Consulting with Danny Hofshi, Mony Orbach, Yuval H.Nacson

We were unable to solve the problem.

Problem characterization PFGA differs in behavior from both BSV

and TCP\IP simulation Related to the Store command – storing

into the BRAM memory Occurs when performing multiples

stores in a row Xilinx reports show no timing warnings

Project usage and integration The project is designed modularly, so

that it can be easily modified and enhanced in the future

“Black Box” design Integration oriented information and

step-by-step walkthrough for using the system in designated section in the project’s final report

Summary and conclusions Fine line between high- and low- level

implementation Easy to write, modify and understand Excellent simulation environment Differences between simulation and FPGA Automatic optimization – good and bad

THANK YOU!

RISC processor implementation using Bluespec part 2 - final presentation

Documents

Transcript of RISC processor implementation using Bluespec part 2 - final presentation