Evaluation of RISC-V RTL with FPGA-Accelerated Simulation · MIDAS Custom Compiler Passes 10...

Post on 21-Apr-2020

11 views 0 download

Transcript of Evaluation of RISC-V RTL with FPGA-Accelerated Simulation · MIDAS Custom Compiler Passes 10...

EvaluationofRISC-VRTLwithFPGA-AcceleratedSimulation

DonggyuKim,ChristopherCelio,DavidBiancolin,JonathanBachrach,KrsteAsanovic

CARRV201710/14/2017

BAR

EvaluationMethodologiesForComputerArchitectureResearch

2

Analytic Power/Energy Modeling

Microarchitectural Cycle-by-Cycle Software Simulators

Simulation Sampling

BAR

3

HowToDoComputerArchitectureResearchInThePast?

FewlinesofC/C++code

Paper

100M~1Binstructions

BAR

HowEasySingle-CyclePerfectCaches?

4

§ 10linesofC++codeinMARRSx86è Easytoimplementunrealistic,non-cycle-accuratemodels

BAR

5

NoµarchSimulationAnyMore!

§ Validationisdifficult- Isyourmodelingstillcycle-accurate?

§ Simulationistooslowà SimulationSampling

Validation of one design instance[1]

Other designs points?

Your target modeling?

[1] Gutierrez et al. Sources of error in full-system simulation, ISPASS 2014

BAR

6

NoSimulationSampling!

§ Phase-basedSampling(e.g.SimPoint)- Notheoreticallyguaranteederrorbounds- ShortperiodsofphasesshowsimilarIPCwheneverrepeated- 401.bzip2withitsreferenceinputinBOOM-2w

§ StatisticalSampling(e.g.SMARTS)- Staticallyboundederrors- Statewarmingproblems• µarchitecturalstateshouldberecoveredfromfunctionalsimulators

§ Whataboutmanaged-languageapplications?

BAR

7

BuildRTLToValidateYourDesignIdeas!

§ RTLisnolongerdifficult- Hardwareconstructionlanguages(e.g.Chisel)- RISC-Vimplementationcodebase(e.g.RocketChip)

§ RTLsimulationisnolongerslow- TensofMIPSusingFPGAs

FGPA-Accelerated Simulation

RTL Implementation

HardwareSpecification

CAD Tools

Area, Timing, Power Evaluation

PerformanceEvaluation

BAR

OldBOOMLayout

8

RegFile

ICache

Uncore

LSU

RenameTable

FPU

ROB

Free List

Issue Window

Branch PredictorALUs

Fetch Buffer

DCache

DCacheControl

IDIVIMUL Busy Table

Bypasses

BAR

§ Automatically transformsanyRTLdesignsintoFPGA-AcceleratedRTLsimulators

§ Quickly evaluatesperformance,power,andenergyofRTLdesignswithrealisticsoftwareapplications.

§ Flexibly Co-simulatesabstractHW&SWmodels

9

MIDAS:TurnYourRTLintoGold

Target RTL

BARMIDASCustomCompilerPasses

10

§ FIRRTL:IRforRTLtransformshttps://github.com/freechipsproject/firrtl

§ InstrumentRTLdesignsfor-Accurateperformancemodeling- EasysimulationcontrolinFPGA- Interactionswithabstracttimingmodels- RTLstatesnapshotsforenergymodeling

FIR

RTL

Com

pile

r

FPGA-Accelerated RTL Simulator

Target RTL Design

Macro MappingFAME1 Transform

Scan Chain InsertionSimulation MappingPlatform Mapping

BARMappingSimulationtotheFPGAHost

11

I/ODevices

ProcessorL2 Cache

/ MainMemory

FPGA Board( e.g. Xilinx Zynq)

Software FPGA

I/ODevices Processor

MemorySystemTiming

Simulation Driver

Board DRAM

L2 Cache /Main Memory

I/O Endpoints

§ SimulationSpeed:3.56MHz(ISCA`16)à ~40MHz(CARRV`17)§ I/OEndpoints- Low-leveltimingtokens<->High-leveltransactions- OptimizethecommunicationsbetweenSW&FPGA

§ L2/MainMemory- Abstracttimingmodel(L2$tags)inFPGA- ActualdatainBoardDRAM- Setsize,associativity,blocksize,latency:runtimeconfigurable

BARMemoryTimingModelValidation

12

§ Apointer-chaseµbenchmarkofccbench runninginBOOM(https://github.com/ucb-bar/ccbench)

§ L1$:16KiB,6cycles§ L2$:1MiB,6+23cycles (runtimeconfigurable)§ DRAM:6+23+80cycles(runtimeconfigurable)

BARTargetDesign:RocketChipGenerator

13

Rocket(In-order Processor)

BOOM-2w (Version 1)(Out-of-order Processor)

Fetch-width 1 2

Issue-width 1 3

Issue slots 20

ROB size 80

Ld/St entries 16/16

Physical registers 32(int)/32(fp) 110

Branch predictor gshare: 16KiB history

BTB entries 40 40

RAS entries 2 4

MSHR entreis 2 2

L1 I$ / D$ 16 KiB or 32 KiB

ITLB / DTLB reaches 128 KiB / 128 KiB

L2 $ 1MiB / 23 cycles

DRAM latency 80 cycles

BAR

§ InstructioncountwithRISC-Vforthereferenceinputs

§ InstructionCountAverage:2.08Trillion§ 445.gobmk,456.hmmer,462.libquantum:failinBOOM

SPEC2006intBenchmarkSuite

14

Benchmarks Instruction Count(T) Benchmarks Instruction Count(T)

400.perlbench 2.48 458.sjeng 2.85

401.bzip2 3.08 462.libquantum 2.09

403.gcc 1.37 464.h264ref 5.07

429.mcf 0.29 471.omnetpp 0.61

445.gobmk 2.04 473.astar 1.05

456.hmmer 2.95 483.xalanbmk 1.10

BARSimulationTimeforSPEC2006int

15

Simulators Speed Average Max(464.h264ref: 5T)

GEM5+Ruby 100 KIPS[1]240 days

(8 months)640 days

(1.8 years)

MARSSx86 400 KIPS[2]60 days

(2 months)160 days

(5.3 months)

MIDAS 18 MIPS (40MHz) 1.4 days 3 days

[1] http://gem5-users.gem5.narkive.com/hCJL6O2Q/gem5-simulation-speed[2] Patel et al. MARSSx86: A full system simulator for x86 CPUs, DAC 2011.

BARCaseStudy:IPC

16

0.0

0.2

0.4

0.6

0.8

1.0

1.2

IPC

Rocket 16KiB L1 Rocket 32KiB L1 BOOM-2w 16KiB L1 BOOM-2w 32KiB L1 Cortex A9

§ RocketiscomparabletoCortexA9§ BOOMoutperformsCortexA9§ Performanceimprovement:16KiBL1$à 32KiBL1$

1.41

BARCaseStudy:MPKIsofBOOM-2w

17

0

10

20

30

40

50

60

MPK

I

Conditional Branch Indirect Branch L1 I-Cache ITLB L1 D-Cache DTLB L2 Cache

§ 40BTBentries,32KiBL1$,1MiBL2$,TLBreach=128KiB§ 473.omnetpp,483.xalancbmk:LargeindirectbranchMPKIsà BiggerBTBs

BAR

0

20

40

60

80

100

Issu

e Sl

ots

/ Iss

ue W

idth

(%)

Issued Slots Empty Slots Non-issued Slots

CaseStudy:IssueQueueUtilization

18

§ IssuedSlots/Cycle=IPC§ EmptySlots:frontendhazards,lackofresources(403.gcc)§ Non-issuedslots:backendhazards

BAR

Full Program Execution In FPGA

StroberPower/EnergyModeling

19

RTL State Snapshots

I/O Traces

PrimeTime PX

RTL SimulationPost-synthesisDesignsRTL Signal Activities

Average Power

§ NostatewarmingisnecessaryinRTL/gate-levelsimulation

BAR

0

100

200

300

400

500

600Ro

cket

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

Rock

et

BOOM

-2w

400.perlbench 401.bzip2 403.gcc 429.mcf 458.sjeng 464.h264ref 471.omnetpp 473.astar 483.xalancbmk

Pow

er (m

W)

Misc

Uncore

L1 D-cache control

L1 D-cache meta + data

L1 I-cache

ROB

LSU

FPU

Integer Unit

Issue Logic

Register File

Rename + Control

Branch Predictior

Fetch Unit

CaseStudy:PowerBreakdown

20

§ Synopsys32nmeducationaltechnology§ 50randomsampleRTLstatesnapshots§ PipelineUtilization↑à Power↑§ BOOM:40%powerfromRenameLogic&RegisterFile

BARCaseStudy:EnergyPerInstruction

21

§ BOOM-2wisperformant,Rocketismoreenergyefficient§ 429.mcf,471.omnetpp:lowpower,lessenergyefficientPipelineUtilization↑à EnergyEfficiency↑

0

200

400

600

800

1000

EPI(p

J / I

nst)

Rocket 32 KiB L1 BOOM-2w 32 KiB L1

1923.4 1027.0

BAR

§ Debugging,validationforRTLdesigns-Assertiondetection-CommitlogcomparisonwithSpike

§ Memorysystemtimingmodeling-RealisticDRAMtimingmodelsinFPGA

§ FireSim:datacentersimulation(fires.im)

-ConnectRocketChips withNICs&switchtimingmodels-Amazonblogpost,7th RISC-Vworkshop

On-goingMIDASResearchInUCB

22

BARMIDAS(Strober)IsOpen-SourceNow

23

Your Processors /Accelerators

§ Examples:https://github.com/ucb-bar/midas-examples

§ RocketChip template:https://github.com/ucb-bar/midas-top-release

§ Willsupportvarioushostplatforms(XilinxZynq,AmazonEC2F1,IntelXeon+In-packageFPGA)

https://github.com/ucb-bar/midas-releasestrober.org

BARAcknowledgements

24

§ Funding:- DARPAAwardNumberHR0011-12-2-0016- TheCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorporationprogramsponsoredbyMARCOandDARPA

- ASPIRELabindustrialsponsorsandaffiliates:Intel,Google,HPE,Huawei,LGE,Nokia,NVIDIA,Oracle,andSamsung

- Kwanjeong Scholarship