RISC processor implementation using Bluespec part 2 - final presentation
description
Transcript of RISC processor implementation using Bluespec part 2 - final presentation
RISC PROCESSOR IMPLEMENTATION USING BLUESPEC
PART 2 - FINAL PRESENTATION
Performed By: Yahel Ben-Avraham and Yaron RimmerInstructor: Mony Orbach
Bi-semesterial, 2012 - 2014
30/3/2014
Project goals Goal: Implementing and analyzing RISC
Processor using Bluespec Verilog Part A:
Studying the working environment, BSV language and the basic processor implementation.
Implementing a simple RISC processor.Run a simple test bench on the FPGA system.
Project goals Goal: Implementing and analyzing RISC
Processor using Bluespec Verilog Part B:
Ramp up the design: Wider instruction set Branch prediction (and flushing) Hazard detection unit and extended Data
forwarding Performance counters
Run the design on the FPGA system
Pipeline Datapath
FETCH DEC EXE MEM1 WB
Instruction Memory
Register File
Memory
MEM2
Forwarding
Branch Predictor
Fetch Tag the instruction’s metadata (PC, cycle) Fetch the requested instruction from the
instruction memory Update next PC
Get next PC’s branch prediction and branch addressCheck for Jump command
Decode Fully parse the received instruction Pre-fetch data from registers potentially in use
Execute According to the instruction’s opcode:
ALU instruction: compute the resultMemory instruction: calculate memory address to
read / write toBranch instruction: check if branch is taken and
update branch resolution Data forwarding
Memory 1 Send a read / write request to the BRAM
Write : data is immediately storedRead: wait for response in the next cycle
Otherwise, pass the incoming data
Memory 2 (mem / skipmem) Implemented in two rules:
For memory read: get BRAM responseOtherwise, pass the incoming struct
Writeback Save needed data to the register file
Register 0 – read only Communication with the wrapper
Data and statistics
Branch Prediction 2-bit saturated, local counter (initialized to WNT) Prediction is acquired in the Fetch stage
Stored and passed along the pipeline Branch resolution determined in the Exec stage
BP is updated accordingly Wrong prediction?
Correction PCFlushing Dec & Exe
Forwarding 4 global Forwarding registers
Each containing (when valid) address, value, cycle Writing - in the end of Exec stage Reading - in the beginning of Exec stage Invalidating - by aging after the Exec stage
FETCH DEC EXE MEM1 WB
Instruction Memory
Register File
Memory
MEM2
Forwarding
Branch Predictor
Forwarding – cont. Special case: register read after memory load
Stalling registers holding the address to be read toIf needed – stall the Exec stage by keeping the
current command in the dec/exec FIFO
FETCH DEC EXE MEM1 WB
Instruction Memory
Register File
Memory
MEM2
Forwarding
Branch Predictor
The working environment Xilinx FPGA development board – of
Virtex 5 familyProgramming to FPGA using JTAGCommunication with DUT using PCIE
The platform enables:Synthesis of design to FPGAReading and writing to memoriesPerformance counters
The platform
SCEMI’s working methods“Standard Co-Emulation Modeling Interface” 2 working methods
TCP/IP simulationFPGA emulation
Establishes port on SW end to FIFO on HW end communication
Parcels (data structs) are delivered in both directions
System layers – PCIE simulation
FPGA
SCEMI – DUT to PCIE
DUT: Wrapper
Datapath
PCLinux O.S.
C++ Executable: TB
Input files
PCIE
System layers – TCP\IP emulation
FPGASCEMI – DUT to PCIE
DUT: Wrapper
Datapath
PCLinux O.S.
C++ Executable: TB
Input files
PCIE
DUT: Bsim_dut
TCP\IP
Our SCEMI platform – SW side A compiled C++ code (TB) is loaded with input files Sends and receives messages from the DUT using
incoming \ outgoing ports We chose to use a “Stop & Wait” protocol Performs the following actions:
Loads the DUT’s instruction memoryLoads the DUT’s register fileSignals the DUT to runWhen done, collecting relevant information
Register file Run statistics
Our SCEMI platform – HW side Our top level module (Wrapper, which is our DUT) Receiving and sending messages to the TB using
FIFOs Contains the Datapath itself as a black box Performs commands from the TB
Loads the instruction memory and the register fileInitiates all the registers and starts \ stops the run of the
datapath Receives data from the datapath (from the WB
stage) and relay it back to the TB
Putting the design to the test As a concluding test, we wrote a Bubble
Sort in assembly, loading 10 unsorted numbers into the memory, then using bubble sort and displaying them in the register file.
The code uses almostall the instruction set, and practicallyevery feature in thedesign.
for (i = 0; i < length -1; ++i) { for (j = 0; j < length - i - 1; ++j) { if (array[j] > array[j + 1]) { int tmp = array[j]; array[j] = array[j + 1]; array[j + 1] = tmp; } } }
Critical example – Bubble sort The program works successfully in the
BSV simulation and the TCP\IP simulation.
Results are incorrect in the PCIE emulation.
Critical example – Bubble sortFPGA result Expected result – TCP\IP
Isolating the problem Trying to isolate the problem – store 4
numbers, and read them into the register file4 ADDI , 4 STORE , 4 LOAD
Encountered unexplained yet repeating results
This is only one of many debugging attempts
Isolating the problem Expected result:
consistent with simulation FPGA result:
Padding with 1 NOP:between ADDI and ST
Padding with 2 or more NOPS:
Further investigation Dismissing possible issues
Design fault – works flawlessly in simulationsClearing the design between runs
Investigating xilinx compilation filesPlace and route – margins are positiveNo note-worthy warnings
Consulting with Danny Hofshi, Mony Orbach, Yuval H.Nacson
We were unable to solve the problem.
Problem characterization PFGA differs in behavior from both BSV
and TCP\IP simulation Related to the Store command – storing
into the BRAM memory Occurs when performing multiples
stores in a row Xilinx reports show no timing warnings
Project usage and integration The project is designed modularly, so
that it can be easily modified and enhanced in the future
“Black Box” design Integration oriented information and
step-by-step walkthrough for using the system in designated section in the project’s final report
Summary and conclusions Fine line between high- and low- level
implementation Easy to write, modify and understand Excellent simulation environment Differences between simulation and FPGA Automatic optimization – good and bad
THANK YOU!