Instruc(on Sets Want to be Free! - RISC-V

Instruc(onSetsWanttobeFree!KrsteAsanovic

UCBerkeley,RISC-VFounda(on,&[email protected]

www.riscv.org

ShanghaiJiaotongUniversity,May8,2017

WhyInstruc,onSetArchitecturema4ers

§ Whycan’tIntelsellmobilechips?- 99%+ofmobilephones/tabletsbasedonARMv7/v8ISA

§ Whycan’tARMpartnerssellservers?- 99%+oflaptops/desktops/serversbasedonAMD64ISA(over95%+builtbyIntel)

§ HowcanIBMs,llsellmainframes?- IBM360,oldestsurvivingISA(50+years)

ISAismostimportantinterfaceincomputersystemwhereso(waremeetshardware

OpenSoBware/StandardsWork!

§  Whynotsuccessfulfree&openstandardsandfree&openimplementa(ons,likeotherfields

§ DominantproprietaryISAsarenotgreatdesigns3

Field Standard Free,OpenImpl. ProprietaryImpl.Networking Ethernet,

TCP/IPMany Many

OS Posix Linux,FreeBSD M/SWindowsCompilers C gcc,LLVM Intelicc,ARMccDatabases SQL MySQL,

PostgresSQLOracle12C,M/SDB2

Graphics OpenGL Mesa3D M/SDirectXISA ?????? ----------- x86,ARM,IBM360

WhatisRISC-V?

§  Fighgenera(onofRISCdesignfromUCBerkeley§ Ahigh-quality,license-free,royalty-freeRISCISAspecifica(on

§  Experiencingrapiduptakeinbothindustryandacademia

§  Standardmaintainedbynon-profitRISC-VFounda(on§ Bothproprietaryandopen-sourcecoreimplementa(ons

§  Supportedbygrowingsharedsogwareecosystem§ Appropriateforalllevelsofcompu(ngsystem,frommicrocontrollerstosupercomputers

RISC-VOrigins

§  In2010,agermanyyearsandmanyprojectsusingMIPS,SPARC,andx86asbasisofresearchatBerkeley,(metochooseISAfornextsetofprojects

§ Obviouschoices:x86andARM

Intelx86“AAA”Instruc,on

§ ASCIIAdjustAgerAddi(on§ ALregisterisdefaultsourceanddes(na(on

§  Ifthelownibbleis>9decimal,ortheauxiliarycarryflagAF=1,then- Add6tolownibbleofALanddiscardoverflow- IncrementhighbyteofAL- SetCFandAF

§  Else- CF=AF=0

§  Singlebyteinstruc(on

6

ARMv7LDMIAEQInstruc,on

LDMIAEQ SP!, {R4-R7, PC}

§  LoaDMul(ple,Increment-Address§ Writesto7registersfrom6loads§ OnlyexecutesifEQcondi(oncodeisset§ WritestothePC(acondi(onalbranch)§ Canchangeinstruc(onsets

§  Idiomfor"stackpopandreturnfromafunc(oncall"

7

RISC-VOriginStory

§  x86impossible–IPissues,toocomplex§ ARMmostlyimpossible–no64-bit,IPissues,complex§  Sowestarted“3-monthproject”insummer2010todevelopourownclean-slateISA- AndrewWaterman,YunsupLee,DavePaserson,KrsteAsanovicprincipaldesigners

§  Fouryearslater,wereleasedfrozenbaseuserspec- Firstpublicspecifica(onreleasedinMay2011- Manytapeoutsandseveralpublica(onsalongtheway

WhyareoutsiderscomplainingaboutchangestoRISC-VinBerkeleyclasses?

8

WhyisnameRISC-V?(pronounced“risk-five”)

9

RISC-I

SOAR(akaRISC-III)

SPUR(akaRISC-IV)

RISC-V(Raven-1,28nmFDSOI,2011)

UniversalISARequirements

§ Workswellwithexis(ngsogwarestacks,languages§  Isna(vehardwareISA,notvirtualmachine/ANDF§  Suitsallsizesofprocessor,fromsmallestmicrocontrollertolargestsupercomputer

§  Suitsallimplementa(ontechnologies,FPGA,ASIC,full-custom,futuredevicetechnologies…

§  Efficientforallmicroarchitecturestyles:microcoded,in-order,decoupled,out-of-order,single-issue,superscalar,…

§  Supportsextensivespecializa(ontoactasbaseforcustomizedaccelerators

§  Stable:notchanging,notdisappearing

10

WhyDidn’tOtherOpenISAsTakeOff?

▪  SPARCV8-Toitscredit,SunMicrosystemsmadeSPARCV8anIEEEstandardin1994

-  Sun,Gaislerofferedopen-sourcecores

-  ISAnowownedbyOracle▪  OpenRISC-GNUopen-sourceeffortstartedin1999,basedonDLXfromComputerArchitecture:AQA▪  64-bitISAwasinprogressin2010▪  Didn’tseparateArchitectureandImplementa(on

▪  Compe(nginmicroprocessorera–nowinSoCera▪  Don’tmeettheneedsofauniversalISA

11

What’sDifferentaboutRISC-V?

§  Simple- FarsmallerthanothercommercialISAs

§ Clean-slatedesign- Clearsepara(onbetweenuserandprivilegedISA- Avoidsµarchitectureortechnology-dependentfeatures

§ AmodularISA- SmallstandardbaseISA- Mul(plestandardextensions

§ Designedforextensibility/specializaKon- Variable-lengthinstruc(onencoding- Vastopcodespaceavailableforinstruc(on-setextensions

§  Stable- Baseandstandardextensionsarefrozen- Addi(onsviaop(onalextensions,notnewversions

12

RISC-VBasePlusStandardExtensions§  FourbaseintegerISAs- RV32E,RV32I,RV64I,RV128I- RV32Eis16-registersubsetofRV32I- Only<50hardwareinstruc(onsneededforbase

§  Standardextensions- M:Integermul(ply/divide- A:Atomicmemoryopera(ons(AMOs+LR/SC)- F:Single-precisionfloa(ng-point- D:Double-precisionfloa(ng-point- G=IMAFD,“General-purpose”ISA- Q:Quad-precisionfloa(ng-point

§ AlltheaboveareafairlystandardRISCencodinginafixed32-bitinstruc(onformat

§ Aboveuser-levelISAcomponentsfrozenin2014- Supportedforeverager 13

RISC-VStandardBaseISADetails

▪  32-bitfixed-width,naturallyalignedinstruc(ons▪  31integerregistersx1-x31,plusx0zeroregister▪  rd/rs1/rs2infixedloca(on,noimplicitregisters▪  Immediatefield(instr[31])alwayssign-extended▪  Floa(ng-pointaddsf0-f31registersplusFPCSR,alsofusedmul-addfour-registerformat

▪  DesignedtosupportPICanddynamiclinking14

“A”:AtomicOpera,onsExtension

Twoclasses:▪  AtomicMemoryOpera(ons(AMO)

-  Fetch-and-op,op=ADD,OR,XOR,MAX,MIN,MAXU,MINU

▪  Load–Reserved/StoreCondi(onal-  Withforwardprogressguaranteeforshortsequences

▪  Allatomicopera(onscanbeannotatedwithtwobits(Acquire/Release)toimplementreleaseconsistencyorsequen(alconsistency

▪  Currentissuesinmemorymodelbeingresolved,willbestrongerthanpurerelaxedmodel.

15

Variable-LengthEncoding

§  Extensionscanuseanymul(pleof16bitsasinstruc(onlength

§ Branches/Jumpstarget16-bitboundarieseveninfixed32-bitbase- Consumes1extrabitofjump/branchaddress

16

Copyright © 2010–2014, The Regents of the University of California. All rights reserved. 7

xxxxxxxxxxxxxxaa 16-bit (aa 6= 11)

xxxxxxxxxxxxxxxx xxxxxxxxxxxbbb11 32-bit (bbb 6= 111)

· · ·xxxx xxxxxxxxxxxxxxxx xxxxxxxxxx011111 48-bit

· · ·xxxx xxxxxxxxxxxxxxxx xxxxxxxxx0111111 64-bit

· · ·xxxx xxxxxxxxxxxxxxxx xnnnxxxxx1111111 (80+16*nnn)-bit, nnn 6=111

· · ·xxxx xxxxxxxxxxxxxxxx x111xxxxx1111111 Reserved for �192-bits

Byte Address: base+4 base+2 base

Figure 1.1: RISC-V instruction length encoding.

// Store 32-bit instruction in x2 register to location pointed to by x3.

sh x2, 0(x3) // Store low bits of instruction in first parcel.

srli x2, x2, 16 // Move high bits down to low bits, overwriting x2.

sh x2, 2(x3) // Store high bits in second parcel.

Figure 1.2: Recommended code sequence to store 32-bit instruction from register to memory.Operates correctly on both big- and little-endian memory systems and avoids misaligned accesseswhen used with variable-length instruction-set extensions.

“C”:CompressedInstruc,onExtension

▪  Compressedcodeimportantfor:-  low-endembeddedtosavesta(ccodespace

-  high-endcommercialworkloadstoreducecachefootprint

▪  Cextensionadds16-bitcompressedinstruc(ons-  2-addressformswithall32registers-  2/3-addressformswithmostfrequent8registers

▪  1compressedinstruc(onexpandsto1baseinstruc(on▪  Assemblylang.programmer&compileroblivious▪  RVC⇒RVIdecoderonly~700gates(~2%ofsmallcore)

▪  Alloriginal32-bitinstruc(onsretainencodingbutnowcanbe16-bitaligned

§  50%-60%instruc(onscompress⇒25%-30%smaller17

100%

141% 131% 129%

169%

80%

100%

120%

140%

160%

180%

RV64C RV64 X86-64 ARMv8 MIPS64

64-bit Address

100%

140% 126%

136%

101%

173%

126%

80%

100%

120%

140%

160%

180%

32-bit Address

SPECint2006compressedcodesizewithsave/restoreop,miza,on(rela,veto“standard”RVC)

§ RISC-VnowsmallestISAfor32-and64-bitaddresses

§ AllresultswithsameGCCcompilerandop(ons18

Proposed“V”VectorExtensionState

x0x1

x31

f0f1

f31

StandardRISC-Vscalarxandfregisters

p0[0]p1[0]

p7[0]

p0[1]p1[1]

p7[1]

p0[MVL-1]p1[MVL-1]

p7[MVL-1]

8vectorpredicateregisters,with1bitperelement

vlr

vcfg

Vectorconfigura(onCSR

VectorlengthCSR

v0[0]v1[0]

v31[0]

v0[1]v1[1]

v31[1]

v0[MVL-1]v1[MVL-1]

v31[MVL-1]

Upto32vectordataregisters,v0-v31,ofatleast4elementseach,withvariablebits/element(8,16,32,64,128)

MVLismaximumvectorlength,implementa(onandconfigura(ondependent,butMVL>=4

RISC-VPrivilegedArchitecture

§  Threeprivilegemodes- User(U-mode)- Supervisor(S-mode)- Machine(M-mode)

§  Supportedcombina(onsofmodes:- M (simpleembeddedsystems)- M,U (embeddedsystemswithprotec(on)- M,S,U (systemsrunningUnix-styleopera(ngsystems)

§ HypervisorstoberuninmodifiedSmode- Priori(zesupportforType-2HypervisorslikeKVM- CanalsosupportType-1Hypervisorsinsamemodel

20

SimpleEmbeddedSystems(M-modeonly)

§ Noaddresstransla(on/protec(on- “Mbare”bare-metalmode- Trapbadphysicaladdressesprecisely

§ Allcodeinherentlytrusted

§  Lowimplementa(oncost- 27bitsofarchitecturalstate(inaddi(ontouserISA)- +27morebitsfor(mers- +27moreforbasicperformancecounters

21

SecureEmbeddedSystems(M,Umodes)

§ M-moderunssecurebootandrun(memonitor§  EmbeddedcoderunsinU-mode§ Physicalmemoryprotec(on(PMP)onU-modeaccesses§  InterrupthandlingcanbedelegatedtoU-modecode- User-levelinterruptsupport

§ Providesarbitrarynumberofisolatedsubsystems

22

M-modemonitor

U-modeprocess1

U-modeprocess2

Device2Interrupts

Device1Interrupts

OtherInterrupts

PMP PMP

VirtualMemoryArchitectures(M,S,Umodes)

§ DesignedtosupportcurrentUnix-styleopera(ngsystems

§  Sv32(RV32)- Demand-paged32-bitvirtual-addressspaces- 2-levelpagetable- 4KiBpages,4MiBmegapages

§  Sv39(RV64)- Demand-paged39-bitvirtual-addressspaces- 3-levelpagetable- 4KiBpages,2MiBmegapages,1GiBgigapages

§  Sv48,Sv57,Sv64(RV64)- Sv39+1/2/3morepage-tablelevels

23

S-ModerunsontopofM-mode

§ M-moderunssecurebootandmonitor§  S-moderunsOS§ U-moderunsapplica(onontopofOSorM-mode

24

M-modemonitor

U-modesystemprocess

S-modeOS

Device2Interrupts

Device1Interrupts

OtherInterrupts

U-modeapp

PMP PMP

RISC-VVirtualiza,onStacks

§ Providecleansplitbetweenlayersofthesogwarestack§ Applica(oncommunicateswithApplica(onExecu(onEnvironment(AEE)viaApplica(onBinaryInterface(ABI)

§ OScommunicatesviaSupervisorExecu(onEnvironment(SEE)viaSystemBinaryInterface(SBI)

§ HypervisorcommunicatesviaHypervisorBinaryInterfacetoHypervisorExecu(onEnvironment

§ AlllevelsofISAdesignedtosupportvirtualiza(on25

2 1.1draft: Volume II: RISC-V Privileged Architectures

ApplicationABIAEE

ApplicationABI

OSSBISEE

ApplicationABI

SBIHypervisor

ApplicationABI

OS

ApplicationABI

ApplicationABI

OS

ApplicationABI

SBI

HBIHEE

Figure 1.1: Di↵erent implementation stacks supporting various forms of privileged execution.

the OS, which provides the AEE. Just as applications interface with an AEE via an ABI, RISC-Voperating systems interface with a supervisor execution environment (SEE) via a supervisor binaryinterface (SBI). An SBI comprises the user-level and supervisor-level ISA together with a set ofSBI function calls. Using a single SBI across all SEE implementations allows a single OS binaryimage to run on any SEE. The SEE can be a simple boot loader and BIOS-style IO system in alow-end hardware platform, or a hypervisor-provided virtual machine in a high-end server, or athin translation layer over a host operating system in an architecture simulation environment.

The rightmost configuration shows a virtual machine monitor configuration where multiple multi-programmed OSs are supported by a single hypervisor. Each OS communicates via an SBI with thehypervisor, which provides the SEE. The hypervisor communicates with the hypervisor executionenvironment (HEE) using a hypervisor binary interface, to isolate the hypervisor from details ofthe hardware platform.

Our graphical convention represents abstract interfaces using black boxes with white text, toseparate them from actual components.

The various ABI, SBI, and HBIs are still a work-in-progress, but we anticipate the SBI and HBIto support devices via virtualized device interfaces similar to virtio [2], and to support devicediscovery. In this manner, only one set of device drivers need be written that can support anyOS or hypervisor, and which can also be shared with the boot environment.

Hardware implementations of the RISC-V ISA will generally require additional features beyond theprivileged ISA to support the various execution environments (AEE, SEE, or HEE), but these weconsider separately as part of a hardware abstraction layer (HAL), as shown in Figure 1.2. Note

ApplicationABIAEEHAL

Hardware

ApplicationABI

OSSBISEE

ApplicationABI

HALHardware

SBIHypervisor

ApplicationABI

OS

ApplicationABI

ApplicationABI

OS

ApplicationABI

SBI

HBIHEE

HardwareHAL

Figure 1.2: Hardware abstraction layers (HALs) abstract underlying hardware platforms from theexecution environments.

SupervisorBinaryInterface

§ Pla�orm-specificfunc(onalityabstractedbehindSBI- Queryphysicalmemorymap- Getdeviceinfo- GethardwarethreadIDand#ofhardwarethreads- Save/restorecoprocessorstate- Query(merproper(es,setup(merinterrupts- Sendinterprocessorinterrupts- SendTLBshootdowns- Reboot/shutdown

§  Simplifieshardwareaccelera(on§  Simplifiesvirtualiza(on

26

27

RISC-V Reference Card ④ Optional Compressed Instructions: RVC

Category Name Fmt RV{32|64|128)I Base Fmt RV mnemonic Fmt RV{F|D|Q} (HP/SP,DP,QP) Category Name Fmt RVCLoads Load Byte I LB rd,rs1,imm R CSRRW rd,csr,rs1 I FL{W,D,Q} rd,rs1,imm Loads Load Word CL C.LW rd′,rs1′,imm

Load Halfword I LH rd,rs1,imm R CSRRS rd,csr,rs1 S FS{W,D,Q} rs1,rs2,imm Load Word SP CI C.LWSP rd,immLoad Word I L{W|D|Q} rd,rs1,imm R CSRRC rd,csr,rs1 R FADD.{S|D|Q} rd,rs1,rs2 Load Double CL C.LD rd′,rs1′,imm

Load Byte Unsigned I LBU rd,rs1,imm R CSRRWI rd,csr,imm R FSUB.{S|D|Q} rd,rs1,rs2 Load Double SP CI C.LWSP rd,immLoad Half Unsigned I L{H|W|D}U rd,rs1,imm R CSRRSI rd,csr,imm R FMUL.{S|D|Q} rd,rs1,rs2 Load Quad CL C.LQ rd′,rs1′,imm

Stores Store Byte S SB rs1,rs2,imm R CSRRCI rd,csr,imm R FDIV.{S|D|Q} rd,rs1,rs2 Load Quad SP CI C.LQSP rd,immStore Halfword S SH rs1,rs2,imm Change Level Env. Call R ECALL R FSQRT.{S|D|Q} rd,rs1 Load Byte Unsigned CL C.LBU rd′,rs1′,imm

Store Word S S{W|D|Q} rs1,rs2,imm R EBREAK R FMADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word CL C.FLW rd′,rs1′,immShifts Shift Left R SLL{|W|D} rd,rs1,rs2 R ERET R FMSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double CL C.FLD rd′,rs1′,imm

Shift Left Immediate I SLLI{|W|D} rd,rs1,shamt R MRTS R FMNSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word SP CI C.FLWSP rd,imm Shift Right R SRL{|W|D} rd,rs1,rs2 R MRTH R FMNADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double SP CI C.FLDSP rd,imm

Shift Right Immediate I SRLI{|W|D} rd,rs1,shamt R HRTS R FSGNJ.{S|D|Q} rd,rs1,rs2 Stores Store Word CS C.SW rs1′,rs2′,imm Shift Right Arithmetic R SRA{|W|D} rd,rs1,rs2 Interrupt Wait for Interrupt R WFI R FSGNJN.{S|D|Q} rd,rs1,rs2 Store Word SP CSS C.SWSP rs2,imm Shift Right Arith Imm I SRAI{|W|D} rd,rs1,shamt MMU Supervisor FENCE R SFENCE.VM rs1 R FSGNJX.{S|D|Q} rd,rs1,rs2 Store Double CS C.SD rs1′,rs2′,imm

Arithmetic ADD R ADD{|W|D} rd,rs1,rs2 R FMIN.{S|D|Q} rd,rs1,rs2 Store Double SP CSS C.SDSP rs2,imm ADD Immediate I ADDI{|W|D} rd,rs1,imm Category Name Fmt R FMAX.{S|D|Q} rd,rs1,rs2 Store Quad CS C.SQ rs1′,rs2′,imm

SUBtract R SUB{|W|D} rd,rs1,rs2 Multiply MULtiply R R FEQ.{S|D|Q} rd,rs1,rs2 Store Quad SP CSS C.SQSP rs2,imm Load Upper Imm U LUI rd,imm MULtiply upper Half R R FLT.{S|D|Q} rd,rs1,rs2 Float Store Word CSS C.FSW rd′,rs1′,imm

Add Upper Imm to PC U AUIPC rd,imm MULtiply Half Sign/Uns R R FLE.{S|D|Q} rd,rs1,rs2 Float Store Double CSS C.FSD rd′,rs1′,immLogical XOR R XOR rd,rs1,rs2 MULtiply upper Half Uns R R FCLASS.{S|D|Q} rd,rs1 Float Store Word SP CSS C.FSWSP rd,imm

XOR Immediate I XORI rd,rs1,imm Divide DIVide R R FMV.S.X rd,rs1 Float Store Double SP CSS C.FSDSP rd,immOR R OR rd,rs1,rs2 DIVide Unsigned R R FMV.X.S rd,rs1 Arithmetic ADD CR C.ADD rd,rs1

OR Immediate I ORI rd,rs1,imm RemainderREMainder R R FCVT.{S|D|Q}.W rd,rs1 ADD Word CR C.ADDW rd',rs2'AND R AND rd,rs1,rs2 REMainder Unsigned R R FCVT.{S|D|Q}.WU rd,rs1 ADD Immediate CI C.ADDI rd,imm

AND Immediate I ANDI rd,rs1,imm R FCVT.W.{S|D|Q} rd,rs1 ADD Word Imm CI C.ADDIW rd,immCompare Set < R SLT rd,rs1,rs2 Category Name Fmt R FCVT.WU.{S|D|Q} rd,rs1 ADD SP Imm * 16 CI C.ADDI16SP x0,imm

Set < Immediate I SLTI rd,rs1,imm Load Load Reserved R LR.{W|D|Q} rd,rs1 R FRCSR rd ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm Set < Unsigned R SLTU rd,rs1,rs2 Store Store Conditional R SC.{W|D|Q} rd,rs1,rs2 R FRRM rd Load Immediate CI C.LI rd,imm

Set < Imm Unsigned I SLTIU rd,rs1,imm Swap SWAP R AMOSWAP.{W|D|Q} rd,rs1,rs2 R FRFLAGS rd Load Upper Imm CI C.LUI rd,imm Branches Branch = SB BEQ rs1,rs2,imm Add ADD R AMOADD.{W|D|Q} rd,rs1,rs2 R FSCSR rd,rs1 MoVe CR C.MV rd,rs1

Branch ≠ SB BNE rs1,rs2,imm Logical XOR R AMOXOR.{W|D|Q} rd,rs1,rs2 R FSRM rd,rs1 SUB CR C.SUB rd',rs2' Branch < SB BLT rs1,rs2,imm AND R AMOAND.{W|D|Q} rd,rs1,rs2 R FSFLAGS rd,rs1 SUB Word CR C.SUBW rd',rs2' Branch ≥ SB BGE rs1,rs2,imm OR R AMOOR.{W|D|Q} rd,rs1,rs2 I FSRMI rd,imm Logical XOR CS C.XOR rd',rs2'

Branch < Unsigned SB BLTU rs1,rs2,imm Min/Max MINimum R AMOMIN.{W|D|Q} rd,rs1,rs2 I FSFLAGSI rd,imm OR CS C.OR rd',rs2'

Branch ≥ Unsigned SB BGEU rs1,rs2,imm MAXimum R AMOMAX.{W|D|Q} rd,rs1,rs2 AND CS C.AND rd',rs2'Jump & Link J&L UJ JAL rd,imm MINimum Unsigned R AMOMINU.{W|D|Q} rd,rs1,rs2 Category Name Fmt RV{F|D|Q} (HP/SP,DP,QP) AND Immediate CB C.ANDI rd',rs2'

Jump & Link Register I JALR rd,rs1,imm MAXimum Unsigned R AMOMAXU.{W|D|Q} rd,rs1,rs2 R FMV.{D|Q}.X rd,rs1 Shifts Shift Left Imm CI C.SLLI rd,immSynch Synch thread I FENCE R FMV.X.{D|Q} rd,rs1 Shift Right Immediate CB C.SRLI rd',imm

Synch Instr & Data I FENCE.I R FCVT.{S|D|Q}.{L|T} rd,rs1 Shift Right Arith Imm CB C.SRAI rd',immSystem System CALL I SCALL R FCVT.{S|D|Q}.{L|T}U rd,rs1 Branches Branch=0 CB C.BEQZ rs1′,imm

System BREAK I SBREAK 16-bit (RVC) and 32-bit Instruction Formats R FCVT.{L|T}.{S|D|Q} rd,rs1 Branch≠0 CB C.BNEZ rs1′,immCounters ReaD CYCLE I RDCYCLE rd R FCVT.{L|T}U.{S|D|Q} rd,rs1 Jump Jump CJ C.J imm ReaD CYCLE upper Half I RDCYCLEH rd CI Jump Register CR C.JR rd,rs1

ReaD TIME I RDTIME rd CSS R Jump & Link J&L CJ C.JAL imm ReaD TIME upper Half I RDTIMEH rd CIW I Jump & Link Register CR C.JALR rs1 ReaD INSTR RETired I RDINSTRET rd CL S System Env. BREAK CI C.EBREAK

ReaD INSTR upper Half I RDINSTRETH rd CS SBCB UCJ UJ

Category Name

Convert to Int Unsigned

Swap Rounding Mode ImmSwap Flags Imm

3 Optional FP Extensions: RV{64|128}{F|D|Q}

Move Move from IntegerMove to Integer

Convert Convert from IntConvert from Int Unsigned

Convert to Int

Configuration Read StatRead Rounding Mode

Read FlagsSwap Status Reg

Swap Rounding ModeSwap Flags

REMU{|W|D} rd,rs1,rs2 Convert from Int UnsignedOptional Atomic Instruction Extension: RVA Convert to Int

RV{32|64|128}A (Atomic) Convert to Int Unsigned

DIV{|W|D} rd,rs1,rs2 Move Move from IntegerDIVU rd,rs1,rs2 Move to IntegerREM{|W|D} rd,rs1,rs2 Convert Convert from Int

MULH rd,rs1,rs2 Compare Float <MULHSU rd,rs1,rs2 Compare Float ≤MULHU rd,rs1,rs2 Categorize Classify Type

Optional Multiply-Divide Extension: RV32M Min/Max MINimumRV32M (Mult-Div) MAXimum

MUL{|W|D} rd,rs1,rs2 Compare Compare Float =

Redirect Trap to Hypervisor Negative Multiply-ADDHypervisor Trap to Supervisor Sign Inject SiGN source

Negative SiGN sourceXor SiGN source

Environment Breakpoint Mul-Add Multiply-ADDEnvironment Return Multiply-SUBtract

Trap Redirect to Supervisor Negative Multiply-SUBtract

SUBtractAtomic Read & Set Bit Imm MULtiply

Atomic Read & Clear Bit Imm DIVideSQuare RooT

Category NameCSR Access Atomic R/W Load Load Atomic Read & Set Bit Store Store Atomic Read & Clear Bit Arithmetic ADD

Atomic R/W Imm

① ② ③Base Integer Instructions (32|64|128) RV Privileged Instructions (32|64|128) 3 Optional FP Extensions: RV32{F|D|Q}

RV32I / RV64I / RV128I + M, A, F, D, Q, C

+14 Privileged

+ 8 for M

+ 11 for A

+ 34 for F, D, Q + 46 for C



















Branch ≥ Unsigned SB BGEU rs1,rs2,imm MAXimum R AMOMAX.{W|D|Q} rd,rs1,rs2 AND CS C.AND rd',rs2'Jump & Link J&L UJ JAL rd,imm MINimum Unsigned R AMOMINU.{W|D|Q} rd,rs1,rs2 Fmt RV{F|D|Q} (HP/SP,DP,QP) AND Immediate CB C.ANDI rd',rs2'






Category Name

Category Name






Convert to Int




REMU{|W|D} rd,rs1,rs2 Convert from Int Unsigned

Optional Atomic Instruction Extension: RVA Convert to IntRV{32|64|128}A (Atomic) Convert to Int Unsigned












Atomic R/W Imm


+ 4 for 64M/128M

28

RV32I / RV64I / RV128I + M, A, F, D, Q, C

+ 12 for 64I/128I

+ 11 for 64A/128A

+ 6 for 64{F|D|Q}/ 128{F|D|Q}

29



















Branch ≥ Unsigned SB BGEU rs1,rs2,imm MAXimum R AMOMAX.{W|D|Q} rd,rs1,rs2 AND CS C.AND rd',rs2'Jump & Link J&L UJ JAL rd,imm MINimum Unsigned R AMOMINU.{W|D|Q} rd,rs1,rs2 Category Name Fmt RV{F|D|Q} (HP/SP,DP,QP) AND Immediate CB C.ANDI rd',rs2'






Category Name






Convert to Int




REMU{|W|D} rd,rs1,rs2 Convert from Int UnsignedOptional Atomic Instruction Extension: RVA Convert to Int

RV{32|64|128}A (Atomic) Convert to Int Unsigned












Atomic R/W Imm


RV32I / RV64I / RV128I + M, A, F, D, Q, C RISC-V “Green Card”

SimplicitybreedsContempt

§ HowcansimpleISAcompetewithindustrymonsters?§ HowdomeasureISAquality?- Sta(ccodebytesforprogram- Dynamiccodebytesfetchedforexecu(on- Microarchitecturalworkgeneratedforexecu(on

30

DynamicBytesFetched

31

Fig. 1: The total dynamic instruction count is shown for each of the ISAs, normalized to the x86-64 instruction count. Thex86-64 retired micro-op count is also shown to provide a comparison between x86-64 instructions and the actual operationsrequired to execute said instructions. By leveraging macro-op fusion (in which some common multi-instruction idioms arecombined into a single operation), the “effective” instruction count for RV64GC can be reduced by 5.4%.

Fig. 2: Total dynamic bytes normalized to x86-64. RV64G, ARMv7, and ARMv8 use fixed 4 byte instructions. x86-64 is avariable-length ISA and for SPECInt averages 3.71 bytes / instruction. RV64GC uses two byte forms of the most commoninstructions allowing it to average 3.00 bytes / instruction.

compiler analysis cannot guarantee that the high-order bits arenot zero.403.gcc: 30% of the RISC-V instruction count is taken upby a memset loop. x86-64 utilizes a movdqa instruction(aligned double quad-word move, i.e., a 128-bit store) anda four-way unrolled loop to move 64 bytes in 7 instructionsversus RV64G’s 4 instructions to move 16 bytes.464.h264ref: 25% of the RISC-V instruction count is takenup by a memcpy loop. Those 21 RV64G instructions togetheraccount for 1.1 trillion fetches, compared to a single x86-64“repeat move” instruction that is executed 450 billion times.Remaining benchmarks

Consistent themes of the remaining benchmarks are asfollows:

• RISC-V’s fused compare-and-branch instruction allowsit to execute typical loops using one less instructioncompared to the ARM and x86 ISAs, both of whichseparate out the comparison and the jump-on-conditioninto two distinct instructions.

• Indexed loads are an incredibly common idiom. Al-though x86-64 and ARM implement indexed loads (reg-ister+register addressing mode) as a single instruction,RISC-V requires up to three instructions to emulate thesame behavior.

In summary, when RISC-V is using fewer instructionsrelative to other ISAs, the code likely contains a significantnumber of branches. When RISC-V is using more instructions,it is often due to a significant number of indexed memory

4

•  RV64GCislowestoverallindynamicbytesfetched•  Despitecurrentlackofsupportforvectoropera(ons

Fig. 1: The total dynamic instruction count is shown for each of the ISAs, normalized to the x86-64 instruction count. Thex86-64 retired micro-op count is also shown to provide a comparison between x86-64 instructions and the actual operationsrequired to execute said instructions. By leveraging macro-op fusion (in which some common multi-instruction idioms arecombined into a single operation), the “effective” instruction count for RV64GC can be reduced by 5.4%.

Fig. 2: Total dynamic bytes normalized to x86-64. RV64G, ARMv7, and ARMv8 use fixed 4 byte instructions. x86-64 is avariable-length ISA and for SPECInt averages 3.71 bytes / instruction. RV64GC uses two byte forms of the most commoninstructions allowing it to average 3.00 bytes / instruction.

compiler analysis cannot guarantee that the high-order bits arenot zero.403.gcc: 30% of the RISC-V instruction count is taken upby a memset loop. x86-64 utilizes a movdqa instruction(aligned double quad-word move, i.e., a 128-bit store) anda four-way unrolled loop to move 64 bytes in 7 instructionsversus RV64G’s 4 instructions to move 16 bytes.464.h264ref: 25% of the RISC-V instruction count is takenup by a memcpy loop. Those 21 RV64G instructions togetheraccount for 1.1 trillion fetches, compared to a single x86-64“repeat move” instruction that is executed 450 billion times.Remaining benchmarks

Consistent themes of the remaining benchmarks are asfollows:

• RISC-V’s fused compare-and-branch instruction allowsit to execute typical loops using one less instructioncompared to the ARM and x86 ISAs, both of whichseparate out the comparison and the jump-on-conditioninto two distinct instructions.

• Indexed loads are an incredibly common idiom. Al-though x86-64 and ARM implement indexed loads (reg-ister+register addressing mode) as a single instruction,RISC-V requires up to three instructions to emulate thesame behavior.

In summary, when RISC-V is using fewer instructionsrelative to other ISAs, the code likely contains a significantnumber of branches. When RISC-V is using more instructions,it is often due to a significant number of indexed memory

4

UC Berkeley Micro:ops%and%Macro:op%Fusion

17

instructions (ISA)

micro-ops (µarch)

rep movs

ld ... st ... add...

Micro:ops%generaIon

cmp

bne

jne

Macro:op%Fusion

Conver,ngInstruc,onstoMicroops

32

Mul(plemicroinstruc(onsfromonemacroinstruc(onOronemicroinstruc(onfrommul(plemacroinstruc(ons

Microopsaremeasureofmicroarchitecturalworkperformed

RISC-VMacro-OpFusionExamples

§  “Loadeffec(veaddressLEA”&(array[offset])slli rd, rs1, {1,2,3}add rd, rd, rs2

§  “indexedload”M[rs1+rs2]add rd, rs1, rs2ld rd, 0(rd)

§  “clearupperword”//rd=rs1&0xffff_ffffslli rd, rs1, 32srli rd, rd, 32

§ Canallbefusedsimplyindecodestage- Manyareexpressiblewith2-bytecompressedinstruc(ons,soeffec(velyjustaddsnew4-byteinstruc(ons

§ RISC-Vapproach:prefermacroopfusiontolargerISA33

RISC-VCompe,,veµarchEffortaBerFusion

34

TABLE V: ARMv8 memory instruction counts. Data is shown for normal loads (ld), loads with increment addressing (ldia),load-pairs (ldp), and load-pairs with increment addressing (ldpia). Data is also shown for the corresponding stores. Many ofthese instructions are likely candidates to be broken up into micro-op sequences when executed on a processor pipeline. Forexample, ldia and ldp require two write ports and the ldpia instruction requires three register write ports.

benchmark % of total ARMv8 instruction countld ldia ldp ldpia st stia stp stpia

400.perlbench 18.18 0.06 3.87 1.02 6.14 1.02 3.81 1.02401.bzip2 22.85 1.71 0.53 0.02 8.28 0.02 0.24 0.02403.gcc 16.80 0.11 2.89 1.04 3.32 1.04 3.03 1.04429.mcf 26.61 0.01 3.21 0.07 3.76 0.07 3.22 0.07

445.gobmk 15.77 1.01 2.04 0.77 6.14 0.74 2.19 0.74456.hmmer 24.20 0.09 0.06 0.02 13.75 0.02 0.01 0.02458.sjeng 17.37 0.00 1.30 0.26 4.38 0.26 1.46 0.26

462.libquantum 14.00 0.00 0.15 0.06 1.85 0.06 0.31 0.06464.h264ref 28.36 0.01 6.61 1.85 3.18 1.82 5.91 1.82471.omnetpp 19.16 0.45 2.56 1.55 8.43 1.54 3.11 1.54

473.astar 24.08 0.01 0.84 0.15 3.73 0.15 0.83 0.15483.xalancbmk 20.94 4.84 1.82 0.68 1.74 0.67 1.51 0.67arithmetic mean 20.69 0.69 2.16 0.62 5.39 0.62 2.14 0.62

Fig. 4: The geometric mean of the instruction counts of thetwelve SPECInt benchmarks is shown for each of the ISAs,normalized to x86-64. The x86-64 micro-op count is reportedfrom the micro-architectural counters on an Intel Xeon proces-sor. The RV64GC macro-op count was collected as describedin Section VI-B. The ARMv8 micro-op count was synthet-ically created by breaking up load-increment-address, load-pair, and load-pair-increment-address into multiple micro-ops.

micro-ops. In particular, ARMv8 supports memory opera-tions with increment addressing modes and load-pair/store-pair instructions. Two write-ports are required for the load-pair instruction (ldp) and for loads with increment addressing(ldia), while three write-ports are required for load-pair withincrement addressing (ldpia). We then modified the QEMUARMv8 ISA simulator to count these instructions that arelikely candidates for generating multiple micro-ops.

Although we show the breakdown of all load and storeinstructions in Table V, we assume for Figure 1 that only ldia,ldp, and ldpia increase the micro-op count for our hypotheticalARMv8 processor. Cracking these instructions into multiple

micro-ops leads to an average increase of 4.09% in theoperation count for ARMv8. As a comparison, the Cortex-A72out-of-order processor is reported to emit “less than 1.1 micro-ops” per instruction and breaks down “move-and-branch” and“load/store-multiple” into multiple micro-ops. [6]

We note that it is possible to “brute-force” these ARMv8instructions and handle them as a single operation within theprocessor backend. Many ARMv8 integer instructions requirethree read ports, so it is likely that most (if not all) ARMv8cores will pay the area overhead of a third read port for thecomplex store instructions. Likewise, they can pay the cost toadd a second (or even third) write port to natively support theload-pair and increment addressing modes. Of course, thereis nothing that prevents a RISC-V core from taking on thiscomplexity, adding the additional register ports, and usingmacro-op fusion to emulate the same complex idioms thatARMv8 has chosen to declare at the ISA level.

VIII. RECOMMENDATIONS

A number of lessons can be learned from analyzing RISC-V’s performance on SPECInt.

A. Programmers

Although it is not legal to modify SPEC for benchmarking,an analysis of its hot loops highlight a few coding idioms thatcan hurt performance on RISC-V (and often other) platforms.

• Avoid unsigned 32-bit integers for array indices. Thesize_t type should be used for array indexing and loopcounting.

• Avoid multi-dimensional arrays if the sizes are knownand fixed. Each additional dimension in the array is anextra level of indirection in C, which is another load frommemory.

• C standard aliasing rules can prevent the compiler frommaking optimizations that are otherwise “obvious” to theprogrammer. For example, you may need to manually‘lift’ code out of a loop that returns the same value everyiteration.

9

[DetailsinUCB2016TRand4thRISC-VworkshoptalkbyChrisCelio]

UCBerkeleyRISC-VCoreGenerators

§ Rocket:FamilyofIn-orderCores- Supports32-bitand64-bitsingle-issueonly- Dual-issuesoon- SimilarinspirittoARMCortexM-seriesandA5/A7/A53

§ BOOM:FamilyofOut-of-OrderCores- Supports64-bitsingle-,dual-,quad-issue- SimilarinspirittoARMCortexA9/A15/A57

35

RISC-VRocketIn-OrderCore

- 64-bit5-stagesingle-issuein-orderpipeline- Designminimizesimpactoflongclock-to-outputdelaysofcompiler-generatedRAMs

- 64-entryBTB,256-entryBHT,2-entryRAS- MMUsupportspage-basedvirtualmemory- IEEE754-2008-compliantFPU-  SupportsSP,DPfusedmul(ply-addswithhardwaresupportforsubnormals

- Currentlyworkingondual-issuein-orderRocket

PCGen. I$

Access

ITLBInt.EX D$

Access

DTLBCommit

FP.RF FP.EX1 FP.EX2 FP.EX3

Inst. Decode

Int.RF

VI$Access

VITLBVInst.

DecodeSeq-

uencer ExpandPCGen.

Bank1R/W

Bank8R/W

...

Rocket Pipeline

to Hwacha

from Rocket Hwacha Pipeline

bypass paths omittedfor simplicity

PC IF ID EX MEM WBTo RoCCCoprocessor

36

ARMCortex-A5vs.RISC-VRocket

Category ARMCortex-A5 RISC-VRocket

ISA 32-bitARMv7 64-bitRISC-Vv2

Architecture Single-IssueIn-Order Single-IssueIn-Order5-stage

Performance 1.57DMIPS/MHz 1.72DMIPS/MHz

Process TSMC40GPLUS TSMC40GPLUS

Areaw/oCaches 0.27mm2 0.14mm2

Areawith16KCaches 0.53mm2 0.39mm2

AreaEfficiency 2.96DMIPS/MHz/mm2 4.41DMIPS/MHz/mm2

Frequency >1GHz >1GHz

DynamicPower <0.08mW/MHz 0.034mW/MHz

- PPArepor(ngcondi(ons- 85%u(liza(on,useDhrystoneforbenchmark,frequency/poweratTT0.9V25C,allregularVTtransistors

- 10%higherinDMIPS/MHz,49%morearea-efficient37

RISC-VBOOMOut-of-OrderCore

§ BaselineDesign- Superscalar(parameterizablewidths)- Fullbranchspecula(on(BTB/BHT/RAS)- Load/storequeuewithstoreordering-  LoadsexecutefullyOoOwrtstores,otherloads-  Store-dataforwardstoloads

§  Es(mated2GHz+in45nm(<30FO4)

38

ARMCortex-A9vs.RISC-VBOOM

Category ARMCortex-A9 RISC-VBOOM-2w

ISA 32-bitARMv7 64-bitRISC-Vv2(RV64G)

Architecture 2wide,3+1issueOut-of-Order8-stage

2wide,3issueOut-of-Order6-stage

Performance 3.59CoreMarks/MHz 3.91CoreMarks/MHz

Process TSMC40GPLUS TSMC40GPLUS

Areawith32Kcaches 2.5mm2 1.00mm2

Areaefficiency 1.4CoreMarks/MHz/mm2 3.9CoreMarks/MHz/mm2

Frequency 1.4GHz 1.5GHz

Caveats:A9includesNEONBOOMis64-bit,hasIEEE-2008fusedmul-add 39

CoreMarkScores

0

1.00

2.00

3.00

4.00

5.00

6.00CoreMark/MHz

CoreM

ark/MHz

UC Berkeley Industry*Comparisons

2

BOOM

$4w

BOOM

$2w

Rocket

in$orderprocessors

out$of$orderprocessors

Ivy9Brid

ge

Cortex

$A15

Cortex

$A9

Cortex

$A8

Cortex

$A5MIPS74

k

SeeChrisCelio’s“BOOM:BerkeleyOut-of-OrderMachine”talkfrom2ndRISC-VWorkshopformoredetails 40

RocketChipGenerator

41

RocketTile

Rocket L1I$

L1D$

RoCCAccel.

CSRFile

L1 Network

L2$ Bank

RocketTile

Rocket L1I$

L1D$

RoCCAccel.CSR

File

JTAGDebug

L2 Network

Cache-CoherentDevice

L2$ BankL2$ Bank

L2$ BankCache-CoherentDevice

TileLink/AXI4Bridge

TileLink/AXI4Bridge

AXI4 Crossbar

DRAMController

DRAMController

High-Speed

IO Device

High-Speed IO

Device

AXI4/AHBBridge

AHB-Lite Bus

Z-scale

Low-Speed

IO DeviceScratch

PadSCRFile

AHB/APBBridge

APB Bus

Peripheral Peripheral Peripheral

1. Change Parameters 2. Develop New Accelerators 3. Develop Own RISC-V Core 4. Develop Own Device

42

Raven-1 Raven-2

Raven-3

Raven-4

EOS14

EOS16

EOS18

EOS20

2011 2012 2013 2014 2015

May Apr Aug Feb Jul Sep Mar Nov Mar

EOS: IBM 45nm SOI Raven: ST 28nm FDSOI Hurricane: ST 28nm FDSOI SWERVE: TSMC 28nm

SWERVE

EOS22 EOS24

UCBerkeleyRISC-VCores:Six28nm&Six45nmRISC-VChipsTapeoutsSoFar

AllbasedonRocketin-ordercore

1GHz

1.65GHz

Hurricane-1

2016

Hurricane-2

43

•  Open-Source RTL •  Arduino-Compatible •  Freedom E SDK

•  Arduino IDE Environment

•  Available for sale now! •  $59

https://www.crowdsupply.com/sifive/hifive1

44

RISC-V is GREAT at Perf and Power Microcontroller CPUCore CPUISA CPU

SpeedDMIPs/MHz Total

DhrystonesDMIPs/mW

IntelCurieModule

IntelQuarkSE

x86 32MHz 1.3 41.6 0.35

ATmega328P AVR AVR(8-bit) 16MHz 0.30 5 0.10

ATSAMD21G18 ARMCortexM0+

ARMv6-M

48MHz 0.93 44.64

NordicNRF51 ARMCortexM0+

ARMv6-M 16MHz 0.93 14.88 1.88

FreedomE310 SiFiveE31 RISC-VRV32IMAC

200MHz320MHz(max)

1.61 320.39 3.16

©2016SiFive.AllRightsReserved.

•  10x Faster Clock than Intel’s Arduino 101 uController •  11x More Dhrystones than ARM’s Arduino Zero (ATSAMD21G18) •  9x More Power Efficient than Intel Quark •  2x More Power Efficient than ARM Cortex M0+

RISC-VOutsideBerkeley

§ Adoptedas“standardISA”forIndia- IIT-Madras$90Mfundingtobuild6differentopen-sourceRISC-Vcores,frommicrocontrollerstoservers

- C-DAC$45Mfundingtobuild2GHzquad-core§ NVIDIAselectedRISC-Vforon-chipmicrocontrollers§  LowRISCprojectbasedinCambridge,UKproducingopen-sourceRISC-VRocket-basedSoCs- LedbyRaspberryPico-founder,privatelyfunded

§ ManycompaniesdevelopingRISC-Vcoresforusein(ghtlyintegratedhardware/sogwareIPblocks

§  FirstcommercialRISC-Vcoreshavealreadyshipped!- RumbleDevelopmentCorp,fordentalcamera/imaging

§ Mul(plecommercialsiliconimplementa(onsshouldbeforsalelaterthisyear 45

RISC-VFounda,on

§ Missionstatement- “tostandardize,protect,andpromotethefreeandopenRISC-Vinstruc(onsetarchitectureanditshardwareandsogwareecosystemforuseinallcompu(ngdevices.”

§  Establishedasa501(c)(6)non-profitcorpora(ononAugust3,2015

§ RickO’ConnorrecruitedasExecu(veDirector§  Firstyear,41+“founding”members.Addi(onalmemberswelcome

46

RISC-VinEduca,on,Pa4erson/Hennessybooks

47

AvailableNow!

Graduatebook6thedi(onwillalsouseRISC-V

WhyuseafreeandopenISAlikeRISC-V?

ModestRISC-VProjectGoalBecometheindustry-standardISAforallcompuKng

devices

ModestRISC-VProjectGoalBecometheindustry-standardISAforallcompuKng

devices

48

Would you like to: Or are you happy with: Pick ISA then pick vendor Pick vendor, use their ISA Get competitive bid for 2nd gen. core Vendor lock in Enable open software stack Binary blobs / software NDAs Build your own core configuration Buying a vendor configuration Sharing your core designs No community for core designs Share spec and verification Trusting vendors’ verification Teach class with real core design Using crippled cores in teaching Resell IP with controller core inside Ask customers to license controller Be assured support for 50+ years Trusting vendor business decisions

RISC-VResearchProjectSponsors

▪  DoEIsisProject▪  DARPAPERFECTprogram▪  DARPAPOEMprogram(Siphotonics)▪  STARnetCenterforFutureArchitectures(C-FAR)▪  LawrenceBerkeleyNa(onalLaboratory▪  Industrialsponsors(ParLab+ASPIRE)

-  Intel,Google,HPE,Huawei,LG,NEC,Microsog,Nokia,NVIDIA,Oracle,Samsung

Ques,ons?

49

Backup

50

Instruc(on Sets Want to be Free! - RISC-V

Documents

Transcript of Instruc(on Sets Want to be Free! - RISC-V