Instruc(on Sets Want to be Free! - RISC-V
Transcript of Instruc(on Sets Want to be Free! - RISC-V
Instruc(onSetsWanttobeFree!KrsteAsanovic
UCBerkeley,RISC-VFounda(on,&[email protected]
www.riscv.org
ShanghaiJiaotongUniversity,May8,2017
WhyInstruc,onSetArchitecturema4ers
§ Whycan’tIntelsellmobilechips?- 99%+ofmobilephones/tabletsbasedonARMv7/v8ISA
§ Whycan’tARMpartnerssellservers?- 99%+oflaptops/desktops/serversbasedonAMD64ISA(over95%+builtbyIntel)
§ HowcanIBMs,llsellmainframes?- IBM360,oldestsurvivingISA(50+years)
ISAismostimportantinterfaceincomputersystemwhereso(waremeetshardware
OpenSoBware/StandardsWork!
§ Whynotsuccessfulfree&openstandardsandfree&openimplementa(ons,likeotherfields
§ DominantproprietaryISAsarenotgreatdesigns3
Field Standard Free,OpenImpl. ProprietaryImpl.Networking Ethernet,
TCP/IPMany Many
OS Posix Linux,FreeBSD M/SWindowsCompilers C gcc,LLVM Intelicc,ARMccDatabases SQL MySQL,
PostgresSQLOracle12C,M/SDB2
Graphics OpenGL Mesa3D M/SDirectXISA ?????? ----------- x86,ARM,IBM360
WhatisRISC-V?
§ Fighgenera(onofRISCdesignfromUCBerkeley§ Ahigh-quality,license-free,royalty-freeRISCISAspecifica(on
§ Experiencingrapiduptakeinbothindustryandacademia
§ Standardmaintainedbynon-profitRISC-VFounda(on§ Bothproprietaryandopen-sourcecoreimplementa(ons
§ Supportedbygrowingsharedsogwareecosystem§ Appropriateforalllevelsofcompu(ngsystem,frommicrocontrollerstosupercomputers
RISC-VOrigins
§ In2010,agermanyyearsandmanyprojectsusingMIPS,SPARC,andx86asbasisofresearchatBerkeley,(metochooseISAfornextsetofprojects
§ Obviouschoices:x86andARM
Intelx86“AAA”Instruc,on
§ ASCIIAdjustAgerAddi(on§ ALregisterisdefaultsourceanddes(na(on
§ Ifthelownibbleis>9decimal,ortheauxiliarycarryflagAF=1,then- Add6tolownibbleofALanddiscardoverflow- IncrementhighbyteofAL- SetCFandAF
§ Else- CF=AF=0
§ Singlebyteinstruc(on
6
ARMv7LDMIAEQInstruc,on
LDMIAEQ SP!, {R4-R7, PC}
§ LoaDMul(ple,Increment-Address§ Writesto7registersfrom6loads§ OnlyexecutesifEQcondi(oncodeisset§ WritestothePC(acondi(onalbranch)§ Canchangeinstruc(onsets
§ Idiomfor"stackpopandreturnfromafunc(oncall"
7
RISC-VOriginStory
§ x86impossible–IPissues,toocomplex§ ARMmostlyimpossible–no64-bit,IPissues,complex§ Sowestarted“3-monthproject”insummer2010todevelopourownclean-slateISA- AndrewWaterman,YunsupLee,DavePaserson,KrsteAsanovicprincipaldesigners
§ Fouryearslater,wereleasedfrozenbaseuserspec- Firstpublicspecifica(onreleasedinMay2011- Manytapeoutsandseveralpublica(onsalongtheway
WhyareoutsiderscomplainingaboutchangestoRISC-VinBerkeleyclasses?
8
WhyisnameRISC-V?(pronounced“risk-five”)
9
RISC-I
SOAR(akaRISC-III)
SPUR(akaRISC-IV)
RISC-V(Raven-1,28nmFDSOI,2011)
UniversalISARequirements
§ Workswellwithexis(ngsogwarestacks,languages§ Isna(vehardwareISA,notvirtualmachine/ANDF§ Suitsallsizesofprocessor,fromsmallestmicrocontrollertolargestsupercomputer
§ Suitsallimplementa(ontechnologies,FPGA,ASIC,full-custom,futuredevicetechnologies…
§ Efficientforallmicroarchitecturestyles:microcoded,in-order,decoupled,out-of-order,single-issue,superscalar,…
§ Supportsextensivespecializa(ontoactasbaseforcustomizedaccelerators
§ Stable:notchanging,notdisappearing
10
WhyDidn’tOtherOpenISAsTakeOff?
▪ SPARCV8-Toitscredit,SunMicrosystemsmadeSPARCV8anIEEEstandardin1994
- Sun,Gaislerofferedopen-sourcecores
- ISAnowownedbyOracle▪ OpenRISC-GNUopen-sourceeffortstartedin1999,basedonDLXfromComputerArchitecture:AQA▪ 64-bitISAwasinprogressin2010▪ Didn’tseparateArchitectureandImplementa(on
▪ Compe(nginmicroprocessorera–nowinSoCera▪ Don’tmeettheneedsofauniversalISA
11
What’sDifferentaboutRISC-V?
§ Simple- FarsmallerthanothercommercialISAs
§ Clean-slatedesign- Clearsepara(onbetweenuserandprivilegedISA- Avoidsµarchitectureortechnology-dependentfeatures
§ AmodularISA- SmallstandardbaseISA- Mul(plestandardextensions
§ Designedforextensibility/specializaKon- Variable-lengthinstruc(onencoding- Vastopcodespaceavailableforinstruc(on-setextensions
§ Stable- Baseandstandardextensionsarefrozen- Addi(onsviaop(onalextensions,notnewversions
12
RISC-VBasePlusStandardExtensions§ FourbaseintegerISAs- RV32E,RV32I,RV64I,RV128I- RV32Eis16-registersubsetofRV32I- Only<50hardwareinstruc(onsneededforbase
§ Standardextensions- M:Integermul(ply/divide- A:Atomicmemoryopera(ons(AMOs+LR/SC)- F:Single-precisionfloa(ng-point- D:Double-precisionfloa(ng-point- G=IMAFD,“General-purpose”ISA- Q:Quad-precisionfloa(ng-point
§ AlltheaboveareafairlystandardRISCencodinginafixed32-bitinstruc(onformat
§ Aboveuser-levelISAcomponentsfrozenin2014- Supportedforeverager 13
RISC-VStandardBaseISADetails
▪ 32-bitfixed-width,naturallyalignedinstruc(ons▪ 31integerregistersx1-x31,plusx0zeroregister▪ rd/rs1/rs2infixedloca(on,noimplicitregisters▪ Immediatefield(instr[31])alwayssign-extended▪ Floa(ng-pointaddsf0-f31registersplusFPCSR,alsofusedmul-addfour-registerformat
▪ DesignedtosupportPICanddynamiclinking14
“A”:AtomicOpera,onsExtension
Twoclasses:▪ AtomicMemoryOpera(ons(AMO)
- Fetch-and-op,op=ADD,OR,XOR,MAX,MIN,MAXU,MINU
▪ Load–Reserved/StoreCondi(onal- Withforwardprogressguaranteeforshortsequences
▪ Allatomicopera(onscanbeannotatedwithtwobits(Acquire/Release)toimplementreleaseconsistencyorsequen(alconsistency
▪ Currentissuesinmemorymodelbeingresolved,willbestrongerthanpurerelaxedmodel.
15
Variable-LengthEncoding
§ Extensionscanuseanymul(pleof16bitsasinstruc(onlength
§ Branches/Jumpstarget16-bitboundarieseveninfixed32-bitbase- Consumes1extrabitofjump/branchaddress
16
Copyright © 2010–2014, The Regents of the University of California. All rights reserved. 7
xxxxxxxxxxxxxxaa 16-bit (aa 6= 11)
xxxxxxxxxxxxxxxx xxxxxxxxxxxbbb11 32-bit (bbb 6= 111)
· · ·xxxx xxxxxxxxxxxxxxxx xxxxxxxxxx011111 48-bit
· · ·xxxx xxxxxxxxxxxxxxxx xxxxxxxxx0111111 64-bit
· · ·xxxx xxxxxxxxxxxxxxxx xnnnxxxxx1111111 (80+16*nnn)-bit, nnn 6=111
· · ·xxxx xxxxxxxxxxxxxxxx x111xxxxx1111111 Reserved for �192-bits
Byte Address: base+4 base+2 base
Figure 1.1: RISC-V instruction length encoding.
// Store 32-bit instruction in x2 register to location pointed to by x3.
sh x2, 0(x3) // Store low bits of instruction in first parcel.
srli x2, x2, 16 // Move high bits down to low bits, overwriting x2.
sh x2, 2(x3) // Store high bits in second parcel.
Figure 1.2: Recommended code sequence to store 32-bit instruction from register to memory.Operates correctly on both big- and little-endian memory systems and avoids misaligned accesseswhen used with variable-length instruction-set extensions.
“C”:CompressedInstruc,onExtension
▪ Compressedcodeimportantfor:- low-endembeddedtosavesta(ccodespace
- high-endcommercialworkloadstoreducecachefootprint
▪ Cextensionadds16-bitcompressedinstruc(ons- 2-addressformswithall32registers- 2/3-addressformswithmostfrequent8registers
▪ 1compressedinstruc(onexpandsto1baseinstruc(on▪ Assemblylang.programmer&compileroblivious▪ RVC⇒RVIdecoderonly~700gates(~2%ofsmallcore)
▪ Alloriginal32-bitinstruc(onsretainencodingbutnowcanbe16-bitaligned
§ 50%-60%instruc(onscompress⇒25%-30%smaller17
100%
141% 131% 129%
169%
80%
100%
120%
140%
160%
180%
RV64C RV64 X86-64 ARMv8 MIPS64
64-bit Address
100%
140% 126%
136%
101%
173%
126%
80%
100%
120%
140%
160%
180%
32-bit Address
SPECint2006compressedcodesizewithsave/restoreop,miza,on(rela,veto“standard”RVC)
§ RISC-VnowsmallestISAfor32-and64-bitaddresses
§ AllresultswithsameGCCcompilerandop(ons18
Proposed“V”VectorExtensionState
x0x1
x31
f0f1
f31
StandardRISC-Vscalarxandfregisters
p0[0]p1[0]
p7[0]
p0[1]p1[1]
p7[1]
p0[MVL-1]p1[MVL-1]
p7[MVL-1]
8vectorpredicateregisters,with1bitperelement
vlr
vcfg
Vectorconfigura(onCSR
VectorlengthCSR
v0[0]v1[0]
v31[0]
v0[1]v1[1]
v31[1]
v0[MVL-1]v1[MVL-1]
v31[MVL-1]
Upto32vectordataregisters,v0-v31,ofatleast4elementseach,withvariablebits/element(8,16,32,64,128)
MVLismaximumvectorlength,implementa(onandconfigura(ondependent,butMVL>=4
RISC-VPrivilegedArchitecture
§ Threeprivilegemodes- User(U-mode)- Supervisor(S-mode)- Machine(M-mode)
§ Supportedcombina(onsofmodes:- M (simpleembeddedsystems)- M,U (embeddedsystemswithprotec(on)- M,S,U (systemsrunningUnix-styleopera(ngsystems)
§ HypervisorstoberuninmodifiedSmode- Priori(zesupportforType-2HypervisorslikeKVM- CanalsosupportType-1Hypervisorsinsamemodel
20
SimpleEmbeddedSystems(M-modeonly)
§ Noaddresstransla(on/protec(on- “Mbare”bare-metalmode- Trapbadphysicaladdressesprecisely
§ Allcodeinherentlytrusted
§ Lowimplementa(oncost- 27bitsofarchitecturalstate(inaddi(ontouserISA)- +27morebitsfor(mers- +27moreforbasicperformancecounters
21
SecureEmbeddedSystems(M,Umodes)
§ M-moderunssecurebootandrun(memonitor§ EmbeddedcoderunsinU-mode§ Physicalmemoryprotec(on(PMP)onU-modeaccesses§ InterrupthandlingcanbedelegatedtoU-modecode- User-levelinterruptsupport
§ Providesarbitrarynumberofisolatedsubsystems
22
M-modemonitor
U-modeprocess1
U-modeprocess2
Device2Interrupts
Device1Interrupts
OtherInterrupts
PMP PMP
VirtualMemoryArchitectures(M,S,Umodes)
§ DesignedtosupportcurrentUnix-styleopera(ngsystems
§ Sv32(RV32)- Demand-paged32-bitvirtual-addressspaces- 2-levelpagetable- 4KiBpages,4MiBmegapages
§ Sv39(RV64)- Demand-paged39-bitvirtual-addressspaces- 3-levelpagetable- 4KiBpages,2MiBmegapages,1GiBgigapages
§ Sv48,Sv57,Sv64(RV64)- Sv39+1/2/3morepage-tablelevels
23
S-ModerunsontopofM-mode
§ M-moderunssecurebootandmonitor§ S-moderunsOS§ U-moderunsapplica(onontopofOSorM-mode
24
M-modemonitor
U-modesystemprocess
S-modeOS
Device2Interrupts
Device1Interrupts
OtherInterrupts
U-modeapp
PMP PMP
RISC-VVirtualiza,onStacks
§ Providecleansplitbetweenlayersofthesogwarestack§ Applica(oncommunicateswithApplica(onExecu(onEnvironment(AEE)viaApplica(onBinaryInterface(ABI)
§ OScommunicatesviaSupervisorExecu(onEnvironment(SEE)viaSystemBinaryInterface(SBI)
§ HypervisorcommunicatesviaHypervisorBinaryInterfacetoHypervisorExecu(onEnvironment
§ AlllevelsofISAdesignedtosupportvirtualiza(on25
2 1.1draft: Volume II: RISC-V Privileged Architectures
ApplicationABIAEE
ApplicationABI
OSSBISEE
ApplicationABI
SBIHypervisor
ApplicationABI
OS
ApplicationABI
ApplicationABI
OS
ApplicationABI
SBI
HBIHEE
Figure 1.1: Di↵erent implementation stacks supporting various forms of privileged execution.
the OS, which provides the AEE. Just as applications interface with an AEE via an ABI, RISC-Voperating systems interface with a supervisor execution environment (SEE) via a supervisor binaryinterface (SBI). An SBI comprises the user-level and supervisor-level ISA together with a set ofSBI function calls. Using a single SBI across all SEE implementations allows a single OS binaryimage to run on any SEE. The SEE can be a simple boot loader and BIOS-style IO system in alow-end hardware platform, or a hypervisor-provided virtual machine in a high-end server, or athin translation layer over a host operating system in an architecture simulation environment.
The rightmost configuration shows a virtual machine monitor configuration where multiple multi-programmed OSs are supported by a single hypervisor. Each OS communicates via an SBI with thehypervisor, which provides the SEE. The hypervisor communicates with the hypervisor executionenvironment (HEE) using a hypervisor binary interface, to isolate the hypervisor from details ofthe hardware platform.
Our graphical convention represents abstract interfaces using black boxes with white text, toseparate them from actual components.
The various ABI, SBI, and HBIs are still a work-in-progress, but we anticipate the SBI and HBIto support devices via virtualized device interfaces similar to virtio [2], and to support devicediscovery. In this manner, only one set of device drivers need be written that can support anyOS or hypervisor, and which can also be shared with the boot environment.
Hardware implementations of the RISC-V ISA will generally require additional features beyond theprivileged ISA to support the various execution environments (AEE, SEE, or HEE), but these weconsider separately as part of a hardware abstraction layer (HAL), as shown in Figure 1.2. Note
ApplicationABIAEEHAL
Hardware
ApplicationABI
OSSBISEE
ApplicationABI
HALHardware
SBIHypervisor
ApplicationABI
OS
ApplicationABI
ApplicationABI
OS
ApplicationABI
SBI
HBIHEE
HardwareHAL
Figure 1.2: Hardware abstraction layers (HALs) abstract underlying hardware platforms from theexecution environments.
SupervisorBinaryInterface
§ Pla�orm-specificfunc(onalityabstractedbehindSBI- Queryphysicalmemorymap- Getdeviceinfo- GethardwarethreadIDand#ofhardwarethreads- Save/restorecoprocessorstate- Query(merproper(es,setup(merinterrupts- Sendinterprocessorinterrupts- SendTLBshootdowns- Reboot/shutdown
§ Simplifieshardwareaccelera(on§ Simplifiesvirtualiza(on
26
27
RISC-V Reference Card ④ Optional Compressed Instructions: RVC
Category Name Fmt RV{32|64|128)I Base Fmt RV mnemonic Fmt RV{F|D|Q} (HP/SP,DP,QP) Category Name Fmt RVCLoads Load Byte I LB rd,rs1,imm R CSRRW rd,csr,rs1 I FL{W,D,Q} rd,rs1,imm Loads Load Word CL C.LW rd′,rs1′,imm
Load Halfword I LH rd,rs1,imm R CSRRS rd,csr,rs1 S FS{W,D,Q} rs1,rs2,imm Load Word SP CI C.LWSP rd,immLoad Word I L{W|D|Q} rd,rs1,imm R CSRRC rd,csr,rs1 R FADD.{S|D|Q} rd,rs1,rs2 Load Double CL C.LD rd′,rs1′,imm
Load Byte Unsigned I LBU rd,rs1,imm R CSRRWI rd,csr,imm R FSUB.{S|D|Q} rd,rs1,rs2 Load Double SP CI C.LWSP rd,immLoad Half Unsigned I L{H|W|D}U rd,rs1,imm R CSRRSI rd,csr,imm R FMUL.{S|D|Q} rd,rs1,rs2 Load Quad CL C.LQ rd′,rs1′,imm
Stores Store Byte S SB rs1,rs2,imm R CSRRCI rd,csr,imm R FDIV.{S|D|Q} rd,rs1,rs2 Load Quad SP CI C.LQSP rd,immStore Halfword S SH rs1,rs2,imm Change Level Env. Call R ECALL R FSQRT.{S|D|Q} rd,rs1 Load Byte Unsigned CL C.LBU rd′,rs1′,imm
Store Word S S{W|D|Q} rs1,rs2,imm R EBREAK R FMADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word CL C.FLW rd′,rs1′,immShifts Shift Left R SLL{|W|D} rd,rs1,rs2 R ERET R FMSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double CL C.FLD rd′,rs1′,imm
Shift Left Immediate I SLLI{|W|D} rd,rs1,shamt R MRTS R FMNSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word SP CI C.FLWSP rd,imm Shift Right R SRL{|W|D} rd,rs1,rs2 R MRTH R FMNADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double SP CI C.FLDSP rd,imm
Shift Right Immediate I SRLI{|W|D} rd,rs1,shamt R HRTS R FSGNJ.{S|D|Q} rd,rs1,rs2 Stores Store Word CS C.SW rs1′,rs2′,imm Shift Right Arithmetic R SRA{|W|D} rd,rs1,rs2 Interrupt Wait for Interrupt R WFI R FSGNJN.{S|D|Q} rd,rs1,rs2 Store Word SP CSS C.SWSP rs2,imm Shift Right Arith Imm I SRAI{|W|D} rd,rs1,shamt MMU Supervisor FENCE R SFENCE.VM rs1 R FSGNJX.{S|D|Q} rd,rs1,rs2 Store Double CS C.SD rs1′,rs2′,imm
Arithmetic ADD R ADD{|W|D} rd,rs1,rs2 R FMIN.{S|D|Q} rd,rs1,rs2 Store Double SP CSS C.SDSP rs2,imm ADD Immediate I ADDI{|W|D} rd,rs1,imm Category Name Fmt R FMAX.{S|D|Q} rd,rs1,rs2 Store Quad CS C.SQ rs1′,rs2′,imm
SUBtract R SUB{|W|D} rd,rs1,rs2 Multiply MULtiply R R FEQ.{S|D|Q} rd,rs1,rs2 Store Quad SP CSS C.SQSP rs2,imm Load Upper Imm U LUI rd,imm MULtiply upper Half R R FLT.{S|D|Q} rd,rs1,rs2 Float Store Word CSS C.FSW rd′,rs1′,imm
Add Upper Imm to PC U AUIPC rd,imm MULtiply Half Sign/Uns R R FLE.{S|D|Q} rd,rs1,rs2 Float Store Double CSS C.FSD rd′,rs1′,immLogical XOR R XOR rd,rs1,rs2 MULtiply upper Half Uns R R FCLASS.{S|D|Q} rd,rs1 Float Store Word SP CSS C.FSWSP rd,imm
XOR Immediate I XORI rd,rs1,imm Divide DIVide R R FMV.S.X rd,rs1 Float Store Double SP CSS C.FSDSP rd,immOR R OR rd,rs1,rs2 DIVide Unsigned R R FMV.X.S rd,rs1 Arithmetic ADD CR C.ADD rd,rs1
OR Immediate I ORI rd,rs1,imm RemainderREMainder R R FCVT.{S|D|Q}.W rd,rs1 ADD Word CR C.ADDW rd',rs2'AND R AND rd,rs1,rs2 REMainder Unsigned R R FCVT.{S|D|Q}.WU rd,rs1 ADD Immediate CI C.ADDI rd,imm
AND Immediate I ANDI rd,rs1,imm R FCVT.W.{S|D|Q} rd,rs1 ADD Word Imm CI C.ADDIW rd,immCompare Set < R SLT rd,rs1,rs2 Category Name Fmt R FCVT.WU.{S|D|Q} rd,rs1 ADD SP Imm * 16 CI C.ADDI16SP x0,imm
Set < Immediate I SLTI rd,rs1,imm Load Load Reserved R LR.{W|D|Q} rd,rs1 R FRCSR rd ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm Set < Unsigned R SLTU rd,rs1,rs2 Store Store Conditional R SC.{W|D|Q} rd,rs1,rs2 R FRRM rd Load Immediate CI C.LI rd,imm
Set < Imm Unsigned I SLTIU rd,rs1,imm Swap SWAP R AMOSWAP.{W|D|Q} rd,rs1,rs2 R FRFLAGS rd Load Upper Imm CI C.LUI rd,imm Branches Branch = SB BEQ rs1,rs2,imm Add ADD R AMOADD.{W|D|Q} rd,rs1,rs2 R FSCSR rd,rs1 MoVe CR C.MV rd,rs1
Branch ≠ SB BNE rs1,rs2,imm Logical XOR R AMOXOR.{W|D|Q} rd,rs1,rs2 R FSRM rd,rs1 SUB CR C.SUB rd',rs2' Branch < SB BLT rs1,rs2,imm AND R AMOAND.{W|D|Q} rd,rs1,rs2 R FSFLAGS rd,rs1 SUB Word CR C.SUBW rd',rs2' Branch ≥ SB BGE rs1,rs2,imm OR R AMOOR.{W|D|Q} rd,rs1,rs2 I FSRMI rd,imm Logical XOR CS C.XOR rd',rs2'
Branch < Unsigned SB BLTU rs1,rs2,imm Min/Max MINimum R AMOMIN.{W|D|Q} rd,rs1,rs2 I FSFLAGSI rd,imm OR CS C.OR rd',rs2'
Branch ≥ Unsigned SB BGEU rs1,rs2,imm MAXimum R AMOMAX.{W|D|Q} rd,rs1,rs2 AND CS C.AND rd',rs2'Jump & Link J&L UJ JAL rd,imm MINimum Unsigned R AMOMINU.{W|D|Q} rd,rs1,rs2 Category Name Fmt RV{F|D|Q} (HP/SP,DP,QP) AND Immediate CB C.ANDI rd',rs2'
Jump & Link Register I JALR rd,rs1,imm MAXimum Unsigned R AMOMAXU.{W|D|Q} rd,rs1,rs2 R FMV.{D|Q}.X rd,rs1 Shifts Shift Left Imm CI C.SLLI rd,immSynch Synch thread I FENCE R FMV.X.{D|Q} rd,rs1 Shift Right Immediate CB C.SRLI rd',imm
Synch Instr & Data I FENCE.I R FCVT.{S|D|Q}.{L|T} rd,rs1 Shift Right Arith Imm CB C.SRAI rd',immSystem System CALL I SCALL R FCVT.{S|D|Q}.{L|T}U rd,rs1 Branches Branch=0 CB C.BEQZ rs1′,imm
System BREAK I SBREAK 16-bit (RVC) and 32-bit Instruction Formats R FCVT.{L|T}.{S|D|Q} rd,rs1 Branch≠0 CB C.BNEZ rs1′,immCounters ReaD CYCLE I RDCYCLE rd R FCVT.{L|T}U.{S|D|Q} rd,rs1 Jump Jump CJ C.J imm ReaD CYCLE upper Half I RDCYCLEH rd CI Jump Register CR C.JR rd,rs1
ReaD TIME I RDTIME rd CSS R Jump & Link J&L CJ C.JAL imm ReaD TIME upper Half I RDTIMEH rd CIW I Jump & Link Register CR C.JALR rs1 ReaD INSTR RETired I RDINSTRET rd CL S System Env. BREAK CI C.EBREAK
ReaD INSTR upper Half I RDINSTRETH rd CS SBCB UCJ UJ
Category Name
Convert to Int Unsigned
Swap Rounding Mode ImmSwap Flags Imm
3 Optional FP Extensions: RV{64|128}{F|D|Q}
Move Move from IntegerMove to Integer
Convert Convert from IntConvert from Int Unsigned
Convert to Int
Configuration Read StatRead Rounding Mode
Read FlagsSwap Status Reg
Swap Rounding ModeSwap Flags
REMU{|W|D} rd,rs1,rs2 Convert from Int UnsignedOptional Atomic Instruction Extension: RVA Convert to Int
RV{32|64|128}A (Atomic) Convert to Int Unsigned
DIV{|W|D} rd,rs1,rs2 Move Move from IntegerDIVU rd,rs1,rs2 Move to IntegerREM{|W|D} rd,rs1,rs2 Convert Convert from Int
MULH rd,rs1,rs2 Compare Float <MULHSU rd,rs1,rs2 Compare Float ≤MULHU rd,rs1,rs2 Categorize Classify Type
Optional Multiply-Divide Extension: RV32M Min/Max MINimumRV32M (Mult-Div) MAXimum
MUL{|W|D} rd,rs1,rs2 Compare Compare Float =
Redirect Trap to Hypervisor Negative Multiply-ADDHypervisor Trap to Supervisor Sign Inject SiGN source
Negative SiGN sourceXor SiGN source
Environment Breakpoint Mul-Add Multiply-ADDEnvironment Return Multiply-SUBtract
Trap Redirect to Supervisor Negative Multiply-SUBtract
SUBtractAtomic Read & Set Bit Imm MULtiply
Atomic Read & Clear Bit Imm DIVideSQuare RooT
Category NameCSR Access Atomic R/W Load Load Atomic Read & Set Bit Store Store Atomic Read & Clear Bit Arithmetic ADD
Atomic R/W Imm
① ② ③Base Integer Instructions (32|64|128) RV Privileged Instructions (32|64|128) 3 Optional FP Extensions: RV32{F|D|Q}
RV32I / RV64I / RV128I + M, A, F, D, Q, C
+14 Privileged
+ 8 for M
+ 11 for A
+ 34 for F, D, Q + 46 for C
RISC-V Reference Card ④ Optional Compressed Instructions: RVC
Category Name Fmt RV{32|64|128)I Base Fmt RV mnemonic Fmt RV{F|D|Q} (HP/SP,DP,QP) Category Name Fmt RVCLoads Load Byte I LB rd,rs1,imm R CSRRW rd,csr,rs1 I FL{W,D,Q} rd,rs1,imm Loads Load Word CL C.LW rd′,rs1′,imm
Load Halfword I LH rd,rs1,imm R CSRRS rd,csr,rs1 S FS{W,D,Q} rs1,rs2,imm Load Word SP CI C.LWSP rd,immLoad Word I L{W|D|Q} rd,rs1,imm R CSRRC rd,csr,rs1 R FADD.{S|D|Q} rd,rs1,rs2 Load Double CL C.LD rd′,rs1′,imm
Load Byte Unsigned I LBU rd,rs1,imm R CSRRWI rd,csr,imm R FSUB.{S|D|Q} rd,rs1,rs2 Load Double SP CI C.LWSP rd,immLoad Half Unsigned I L{H|W|D}U rd,rs1,imm R CSRRSI rd,csr,imm R FMUL.{S|D|Q} rd,rs1,rs2 Load Quad CL C.LQ rd′,rs1′,imm
Stores Store Byte S SB rs1,rs2,imm R CSRRCI rd,csr,imm R FDIV.{S|D|Q} rd,rs1,rs2 Load Quad SP CI C.LQSP rd,immStore Halfword S SH rs1,rs2,imm Change Level Env. Call R ECALL R FSQRT.{S|D|Q} rd,rs1 Load Byte Unsigned CL C.LBU rd′,rs1′,imm
Store Word S S{W|D|Q} rs1,rs2,imm R EBREAK R FMADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word CL C.FLW rd′,rs1′,immShifts Shift Left R SLL{|W|D} rd,rs1,rs2 R ERET R FMSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double CL C.FLD rd′,rs1′,imm
Shift Left Immediate I SLLI{|W|D} rd,rs1,shamt R MRTS R FMNSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word SP CI C.FLWSP rd,imm Shift Right R SRL{|W|D} rd,rs1,rs2 R MRTH R FMNADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double SP CI C.FLDSP rd,imm
Shift Right Immediate I SRLI{|W|D} rd,rs1,shamt R HRTS R FSGNJ.{S|D|Q} rd,rs1,rs2 Stores Store Word CS C.SW rs1′,rs2′,imm Shift Right Arithmetic R SRA{|W|D} rd,rs1,rs2 Interrupt Wait for Interrupt R WFI R FSGNJN.{S|D|Q} rd,rs1,rs2 Store Word SP CSS C.SWSP rs2,imm Shift Right Arith Imm I SRAI{|W|D} rd,rs1,shamt MMU Supervisor FENCE R SFENCE.VM rs1 R FSGNJX.{S|D|Q} rd,rs1,rs2 Store Double CS C.SD rs1′,rs2′,imm
Arithmetic ADD R ADD{|W|D} rd,rs1,rs2 R FMIN.{S|D|Q} rd,rs1,rs2 Store Double SP CSS C.SDSP rs2,imm ADD Immediate I ADDI{|W|D} rd,rs1,imm Category Name Fmt R FMAX.{S|D|Q} rd,rs1,rs2 Store Quad CS C.SQ rs1′,rs2′,imm
SUBtract R SUB{|W|D} rd,rs1,rs2 Multiply MULtiply R R FEQ.{S|D|Q} rd,rs1,rs2 Store Quad SP CSS C.SQSP rs2,imm Load Upper Imm U LUI rd,imm MULtiply upper Half R R FLT.{S|D|Q} rd,rs1,rs2 Float Store Word CSS C.FSW rd′,rs1′,imm
Add Upper Imm to PC U AUIPC rd,imm MULtiply Half Sign/Uns R R FLE.{S|D|Q} rd,rs1,rs2 Float Store Double CSS C.FSD rd′,rs1′,immLogical XOR R XOR rd,rs1,rs2 MULtiply upper Half Uns R R FCLASS.{S|D|Q} rd,rs1 Float Store Word SP CSS C.FSWSP rd,imm
XOR Immediate I XORI rd,rs1,imm Divide DIVide R R FMV.S.X rd,rs1 Float Store Double SP CSS C.FSDSP rd,immOR R OR rd,rs1,rs2 DIVide Unsigned R R FMV.X.S rd,rs1 Arithmetic ADD CR C.ADD rd,rs1
OR Immediate I ORI rd,rs1,imm RemainderREMainder R R FCVT.{S|D|Q}.W rd,rs1 ADD Word CR C.ADDW rd',rs2'AND R AND rd,rs1,rs2 REMainder Unsigned R R FCVT.{S|D|Q}.WU rd,rs1 ADD Immediate CI C.ADDI rd,imm
AND Immediate I ANDI rd,rs1,imm R FCVT.W.{S|D|Q} rd,rs1 ADD Word Imm CI C.ADDIW rd,immCompare Set < R SLT rd,rs1,rs2 Category Name Fmt R FCVT.WU.{S|D|Q} rd,rs1 ADD SP Imm * 16 CI C.ADDI16SP x0,imm
Set < Immediate I SLTI rd,rs1,imm Load Load Reserved R LR.{W|D|Q} rd,rs1 R FRCSR rd ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm Set < Unsigned R SLTU rd,rs1,rs2 Store Store Conditional R SC.{W|D|Q} rd,rs1,rs2 R FRRM rd Load Immediate CI C.LI rd,imm
Set < Imm Unsigned I SLTIU rd,rs1,imm Swap SWAP R AMOSWAP.{W|D|Q} rd,rs1,rs2 R FRFLAGS rd Load Upper Imm CI C.LUI rd,imm Branches Branch = SB BEQ rs1,rs2,imm Add ADD R AMOADD.{W|D|Q} rd,rs1,rs2 R FSCSR rd,rs1 MoVe CR C.MV rd,rs1
Branch ≠ SB BNE rs1,rs2,imm Logical XOR R AMOXOR.{W|D|Q} rd,rs1,rs2 R FSRM rd,rs1 SUB CR C.SUB rd',rs2' Branch < SB BLT rs1,rs2,imm AND R AMOAND.{W|D|Q} rd,rs1,rs2 R FSFLAGS rd,rs1 SUB Word CR C.SUBW rd',rs2' Branch ≥ SB BGE rs1,rs2,imm OR R AMOOR.{W|D|Q} rd,rs1,rs2 I FSRMI rd,imm Logical XOR CS C.XOR rd',rs2'
Branch < Unsigned SB BLTU rs1,rs2,imm Min/Max MINimum R AMOMIN.{W|D|Q} rd,rs1,rs2 I FSFLAGSI rd,imm OR CS C.OR rd',rs2'
Branch ≥ Unsigned SB BGEU rs1,rs2,imm MAXimum R AMOMAX.{W|D|Q} rd,rs1,rs2 AND CS C.AND rd',rs2'Jump & Link J&L UJ JAL rd,imm MINimum Unsigned R AMOMINU.{W|D|Q} rd,rs1,rs2 Fmt RV{F|D|Q} (HP/SP,DP,QP) AND Immediate CB C.ANDI rd',rs2'
Jump & Link Register I JALR rd,rs1,imm MAXimum Unsigned R AMOMAXU.{W|D|Q} rd,rs1,rs2 R FMV.{D|Q}.X rd,rs1 Shifts Shift Left Imm CI C.SLLI rd,immSynch Synch thread I FENCE R FMV.X.{D|Q} rd,rs1 Shift Right Immediate CB C.SRLI rd',imm
Synch Instr & Data I FENCE.I R FCVT.{S|D|Q}.{L|T} rd,rs1 Shift Right Arith Imm CB C.SRAI rd',immSystem System CALL I SCALL R FCVT.{S|D|Q}.{L|T}U rd,rs1 Branches Branch=0 CB C.BEQZ rs1′,imm
System BREAK I SBREAK 16-bit (RVC) and 32-bit Instruction Formats R FCVT.{L|T}.{S|D|Q} rd,rs1 Branch≠0 CB C.BNEZ rs1′,immCounters ReaD CYCLE I RDCYCLE rd R FCVT.{L|T}U.{S|D|Q} rd,rs1 Jump Jump CJ C.J imm ReaD CYCLE upper Half I RDCYCLEH rd CI Jump Register CR C.JR rd,rs1
ReaD TIME I RDTIME rd CSS R Jump & Link J&L CJ C.JAL imm ReaD TIME upper Half I RDTIMEH rd CIW I Jump & Link Register CR C.JALR rs1 ReaD INSTR RETired I RDINSTRET rd CL S System Env. BREAK CI C.EBREAK
ReaD INSTR upper Half I RDINSTRETH rd CS SBCB UCJ UJ
Category Name
Category Name
Convert to Int Unsigned
Swap Rounding Mode ImmSwap Flags Imm
3 Optional FP Extensions: RV{64|128}{F|D|Q}
Move Move from IntegerMove to Integer
Convert Convert from IntConvert from Int Unsigned
Convert to Int
Configuration Read StatRead Rounding Mode
Read FlagsSwap Status Reg
Swap Rounding ModeSwap Flags
REMU{|W|D} rd,rs1,rs2 Convert from Int Unsigned
Optional Atomic Instruction Extension: RVA Convert to IntRV{32|64|128}A (Atomic) Convert to Int Unsigned
DIV{|W|D} rd,rs1,rs2 Move Move from IntegerDIVU rd,rs1,rs2 Move to IntegerREM{|W|D} rd,rs1,rs2 Convert Convert from Int
MULH rd,rs1,rs2 Compare Float <MULHSU rd,rs1,rs2 Compare Float ≤MULHU rd,rs1,rs2 Categorize Classify Type
Optional Multiply-Divide Extension: RV32M Min/Max MINimumRV32M (Mult-Div) MAXimum
MUL{|W|D} rd,rs1,rs2 Compare Compare Float =
Redirect Trap to Hypervisor Negative Multiply-ADDHypervisor Trap to Supervisor Sign Inject SiGN source
Negative SiGN sourceXor SiGN source
Environment Breakpoint Mul-Add Multiply-ADDEnvironment Return Multiply-SUBtract
Trap Redirect to Supervisor Negative Multiply-SUBtract
SUBtractAtomic Read & Set Bit Imm MULtiply
Atomic Read & Clear Bit Imm DIVideSQuare RooT
Category NameCSR Access Atomic R/W Load Load Atomic Read & Set Bit Store Store Atomic Read & Clear Bit Arithmetic ADD
Atomic R/W Imm
① ② ③Base Integer Instructions (32|64|128) RV Privileged Instructions (32|64|128) 3 Optional FP Extensions: RV32{F|D|Q}
+ 4 for 64M/128M
28
RV32I / RV64I / RV128I + M, A, F, D, Q, C
+ 12 for 64I/128I
+ 11 for 64A/128A
+ 6 for 64{F|D|Q}/ 128{F|D|Q}
29
RISC-V Reference Card ④ Optional Compressed Instructions: RVC
Category Name Fmt RV{32|64|128)I Base Fmt RV mnemonic Fmt RV{F|D|Q} (HP/SP,DP,QP) Category Name Fmt RVCLoads Load Byte I LB rd,rs1,imm R CSRRW rd,csr,rs1 I FL{W,D,Q} rd,rs1,imm Loads Load Word CL C.LW rd′,rs1′,imm
Load Halfword I LH rd,rs1,imm R CSRRS rd,csr,rs1 S FS{W,D,Q} rs1,rs2,imm Load Word SP CI C.LWSP rd,immLoad Word I L{W|D|Q} rd,rs1,imm R CSRRC rd,csr,rs1 R FADD.{S|D|Q} rd,rs1,rs2 Load Double CL C.LD rd′,rs1′,imm
Load Byte Unsigned I LBU rd,rs1,imm R CSRRWI rd,csr,imm R FSUB.{S|D|Q} rd,rs1,rs2 Load Double SP CI C.LWSP rd,immLoad Half Unsigned I L{H|W|D}U rd,rs1,imm R CSRRSI rd,csr,imm R FMUL.{S|D|Q} rd,rs1,rs2 Load Quad CL C.LQ rd′,rs1′,imm
Stores Store Byte S SB rs1,rs2,imm R CSRRCI rd,csr,imm R FDIV.{S|D|Q} rd,rs1,rs2 Load Quad SP CI C.LQSP rd,immStore Halfword S SH rs1,rs2,imm Change Level Env. Call R ECALL R FSQRT.{S|D|Q} rd,rs1 Load Byte Unsigned CL C.LBU rd′,rs1′,imm
Store Word S S{W|D|Q} rs1,rs2,imm R EBREAK R FMADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word CL C.FLW rd′,rs1′,immShifts Shift Left R SLL{|W|D} rd,rs1,rs2 R ERET R FMSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double CL C.FLD rd′,rs1′,imm
Shift Left Immediate I SLLI{|W|D} rd,rs1,shamt R MRTS R FMNSUB.{S|D|Q} rd,rs1,rs2,rs3 Float Load Word SP CI C.FLWSP rd,imm Shift Right R SRL{|W|D} rd,rs1,rs2 R MRTH R FMNADD.{S|D|Q} rd,rs1,rs2,rs3 Float Load Double SP CI C.FLDSP rd,imm
Shift Right Immediate I SRLI{|W|D} rd,rs1,shamt R HRTS R FSGNJ.{S|D|Q} rd,rs1,rs2 Stores Store Word CS C.SW rs1′,rs2′,imm Shift Right Arithmetic R SRA{|W|D} rd,rs1,rs2 Interrupt Wait for Interrupt R WFI R FSGNJN.{S|D|Q} rd,rs1,rs2 Store Word SP CSS C.SWSP rs2,imm Shift Right Arith Imm I SRAI{|W|D} rd,rs1,shamt MMU Supervisor FENCE R SFENCE.VM rs1 R FSGNJX.{S|D|Q} rd,rs1,rs2 Store Double CS C.SD rs1′,rs2′,imm
Arithmetic ADD R ADD{|W|D} rd,rs1,rs2 R FMIN.{S|D|Q} rd,rs1,rs2 Store Double SP CSS C.SDSP rs2,imm ADD Immediate I ADDI{|W|D} rd,rs1,imm Category Name Fmt R FMAX.{S|D|Q} rd,rs1,rs2 Store Quad CS C.SQ rs1′,rs2′,imm
SUBtract R SUB{|W|D} rd,rs1,rs2 Multiply MULtiply R R FEQ.{S|D|Q} rd,rs1,rs2 Store Quad SP CSS C.SQSP rs2,imm Load Upper Imm U LUI rd,imm MULtiply upper Half R R FLT.{S|D|Q} rd,rs1,rs2 Float Store Word CSS C.FSW rd′,rs1′,imm
Add Upper Imm to PC U AUIPC rd,imm MULtiply Half Sign/Uns R R FLE.{S|D|Q} rd,rs1,rs2 Float Store Double CSS C.FSD rd′,rs1′,immLogical XOR R XOR rd,rs1,rs2 MULtiply upper Half Uns R R FCLASS.{S|D|Q} rd,rs1 Float Store Word SP CSS C.FSWSP rd,imm
XOR Immediate I XORI rd,rs1,imm Divide DIVide R R FMV.S.X rd,rs1 Float Store Double SP CSS C.FSDSP rd,immOR R OR rd,rs1,rs2 DIVide Unsigned R R FMV.X.S rd,rs1 Arithmetic ADD CR C.ADD rd,rs1
OR Immediate I ORI rd,rs1,imm RemainderREMainder R R FCVT.{S|D|Q}.W rd,rs1 ADD Word CR C.ADDW rd',rs2'AND R AND rd,rs1,rs2 REMainder Unsigned R R FCVT.{S|D|Q}.WU rd,rs1 ADD Immediate CI C.ADDI rd,imm
AND Immediate I ANDI rd,rs1,imm R FCVT.W.{S|D|Q} rd,rs1 ADD Word Imm CI C.ADDIW rd,immCompare Set < R SLT rd,rs1,rs2 Category Name Fmt R FCVT.WU.{S|D|Q} rd,rs1 ADD SP Imm * 16 CI C.ADDI16SP x0,imm
Set < Immediate I SLTI rd,rs1,imm Load Load Reserved R LR.{W|D|Q} rd,rs1 R FRCSR rd ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm Set < Unsigned R SLTU rd,rs1,rs2 Store Store Conditional R SC.{W|D|Q} rd,rs1,rs2 R FRRM rd Load Immediate CI C.LI rd,imm
Set < Imm Unsigned I SLTIU rd,rs1,imm Swap SWAP R AMOSWAP.{W|D|Q} rd,rs1,rs2 R FRFLAGS rd Load Upper Imm CI C.LUI rd,imm Branches Branch = SB BEQ rs1,rs2,imm Add ADD R AMOADD.{W|D|Q} rd,rs1,rs2 R FSCSR rd,rs1 MoVe CR C.MV rd,rs1
Branch ≠ SB BNE rs1,rs2,imm Logical XOR R AMOXOR.{W|D|Q} rd,rs1,rs2 R FSRM rd,rs1 SUB CR C.SUB rd',rs2' Branch < SB BLT rs1,rs2,imm AND R AMOAND.{W|D|Q} rd,rs1,rs2 R FSFLAGS rd,rs1 SUB Word CR C.SUBW rd',rs2' Branch ≥ SB BGE rs1,rs2,imm OR R AMOOR.{W|D|Q} rd,rs1,rs2 I FSRMI rd,imm Logical XOR CS C.XOR rd',rs2'
Branch < Unsigned SB BLTU rs1,rs2,imm Min/Max MINimum R AMOMIN.{W|D|Q} rd,rs1,rs2 I FSFLAGSI rd,imm OR CS C.OR rd',rs2'
Branch ≥ Unsigned SB BGEU rs1,rs2,imm MAXimum R AMOMAX.{W|D|Q} rd,rs1,rs2 AND CS C.AND rd',rs2'Jump & Link J&L UJ JAL rd,imm MINimum Unsigned R AMOMINU.{W|D|Q} rd,rs1,rs2 Category Name Fmt RV{F|D|Q} (HP/SP,DP,QP) AND Immediate CB C.ANDI rd',rs2'
Jump & Link Register I JALR rd,rs1,imm MAXimum Unsigned R AMOMAXU.{W|D|Q} rd,rs1,rs2 R FMV.{D|Q}.X rd,rs1 Shifts Shift Left Imm CI C.SLLI rd,immSynch Synch thread I FENCE R FMV.X.{D|Q} rd,rs1 Shift Right Immediate CB C.SRLI rd',imm
Synch Instr & Data I FENCE.I R FCVT.{S|D|Q}.{L|T} rd,rs1 Shift Right Arith Imm CB C.SRAI rd',immSystem System CALL I SCALL R FCVT.{S|D|Q}.{L|T}U rd,rs1 Branches Branch=0 CB C.BEQZ rs1′,imm
System BREAK I SBREAK 16-bit (RVC) and 32-bit Instruction Formats R FCVT.{L|T}.{S|D|Q} rd,rs1 Branch≠0 CB C.BNEZ rs1′,immCounters ReaD CYCLE I RDCYCLE rd R FCVT.{L|T}U.{S|D|Q} rd,rs1 Jump Jump CJ C.J imm ReaD CYCLE upper Half I RDCYCLEH rd CI Jump Register CR C.JR rd,rs1
ReaD TIME I RDTIME rd CSS R Jump & Link J&L CJ C.JAL imm ReaD TIME upper Half I RDTIMEH rd CIW I Jump & Link Register CR C.JALR rs1 ReaD INSTR RETired I RDINSTRET rd CL S System Env. BREAK CI C.EBREAK
ReaD INSTR upper Half I RDINSTRETH rd CS SBCB UCJ UJ
Category Name
Convert to Int Unsigned
Swap Rounding Mode ImmSwap Flags Imm
3 Optional FP Extensions: RV{64|128}{F|D|Q}
Move Move from IntegerMove to Integer
Convert Convert from IntConvert from Int Unsigned
Convert to Int
Configuration Read StatRead Rounding Mode
Read FlagsSwap Status Reg
Swap Rounding ModeSwap Flags
REMU{|W|D} rd,rs1,rs2 Convert from Int UnsignedOptional Atomic Instruction Extension: RVA Convert to Int
RV{32|64|128}A (Atomic) Convert to Int Unsigned
DIV{|W|D} rd,rs1,rs2 Move Move from IntegerDIVU rd,rs1,rs2 Move to IntegerREM{|W|D} rd,rs1,rs2 Convert Convert from Int
MULH rd,rs1,rs2 Compare Float <MULHSU rd,rs1,rs2 Compare Float ≤MULHU rd,rs1,rs2 Categorize Classify Type
Optional Multiply-Divide Extension: RV32M Min/Max MINimumRV32M (Mult-Div) MAXimum
MUL{|W|D} rd,rs1,rs2 Compare Compare Float =
Redirect Trap to Hypervisor Negative Multiply-ADDHypervisor Trap to Supervisor Sign Inject SiGN source
Negative SiGN sourceXor SiGN source
Environment Breakpoint Mul-Add Multiply-ADDEnvironment Return Multiply-SUBtract
Trap Redirect to Supervisor Negative Multiply-SUBtract
SUBtractAtomic Read & Set Bit Imm MULtiply
Atomic Read & Clear Bit Imm DIVideSQuare RooT
Category NameCSR Access Atomic R/W Load Load Atomic Read & Set Bit Store Store Atomic Read & Clear Bit Arithmetic ADD
Atomic R/W Imm
① ② ③Base Integer Instructions (32|64|128) RV Privileged Instructions (32|64|128) 3 Optional FP Extensions: RV32{F|D|Q}
RV32I / RV64I / RV128I + M, A, F, D, Q, C RISC-V “Green Card”
SimplicitybreedsContempt
§ HowcansimpleISAcompetewithindustrymonsters?§ HowdomeasureISAquality?- Sta(ccodebytesforprogram- Dynamiccodebytesfetchedforexecu(on- Microarchitecturalworkgeneratedforexecu(on
30
DynamicBytesFetched
31
Fig. 1: The total dynamic instruction count is shown for each of the ISAs, normalized to the x86-64 instruction count. Thex86-64 retired micro-op count is also shown to provide a comparison between x86-64 instructions and the actual operationsrequired to execute said instructions. By leveraging macro-op fusion (in which some common multi-instruction idioms arecombined into a single operation), the “effective” instruction count for RV64GC can be reduced by 5.4%.
Fig. 2: Total dynamic bytes normalized to x86-64. RV64G, ARMv7, and ARMv8 use fixed 4 byte instructions. x86-64 is avariable-length ISA and for SPECInt averages 3.71 bytes / instruction. RV64GC uses two byte forms of the most commoninstructions allowing it to average 3.00 bytes / instruction.
compiler analysis cannot guarantee that the high-order bits arenot zero.403.gcc: 30% of the RISC-V instruction count is taken upby a memset loop. x86-64 utilizes a movdqa instruction(aligned double quad-word move, i.e., a 128-bit store) anda four-way unrolled loop to move 64 bytes in 7 instructionsversus RV64G’s 4 instructions to move 16 bytes.464.h264ref: 25% of the RISC-V instruction count is takenup by a memcpy loop. Those 21 RV64G instructions togetheraccount for 1.1 trillion fetches, compared to a single x86-64“repeat move” instruction that is executed 450 billion times.Remaining benchmarks
Consistent themes of the remaining benchmarks are asfollows:
• RISC-V’s fused compare-and-branch instruction allowsit to execute typical loops using one less instructioncompared to the ARM and x86 ISAs, both of whichseparate out the comparison and the jump-on-conditioninto two distinct instructions.
• Indexed loads are an incredibly common idiom. Al-though x86-64 and ARM implement indexed loads (reg-ister+register addressing mode) as a single instruction,RISC-V requires up to three instructions to emulate thesame behavior.
In summary, when RISC-V is using fewer instructionsrelative to other ISAs, the code likely contains a significantnumber of branches. When RISC-V is using more instructions,it is often due to a significant number of indexed memory
4
• RV64GCislowestoverallindynamicbytesfetched• Despitecurrentlackofsupportforvectoropera(ons
Fig. 1: The total dynamic instruction count is shown for each of the ISAs, normalized to the x86-64 instruction count. Thex86-64 retired micro-op count is also shown to provide a comparison between x86-64 instructions and the actual operationsrequired to execute said instructions. By leveraging macro-op fusion (in which some common multi-instruction idioms arecombined into a single operation), the “effective” instruction count for RV64GC can be reduced by 5.4%.
Fig. 2: Total dynamic bytes normalized to x86-64. RV64G, ARMv7, and ARMv8 use fixed 4 byte instructions. x86-64 is avariable-length ISA and for SPECInt averages 3.71 bytes / instruction. RV64GC uses two byte forms of the most commoninstructions allowing it to average 3.00 bytes / instruction.
compiler analysis cannot guarantee that the high-order bits arenot zero.403.gcc: 30% of the RISC-V instruction count is taken upby a memset loop. x86-64 utilizes a movdqa instruction(aligned double quad-word move, i.e., a 128-bit store) anda four-way unrolled loop to move 64 bytes in 7 instructionsversus RV64G’s 4 instructions to move 16 bytes.464.h264ref: 25% of the RISC-V instruction count is takenup by a memcpy loop. Those 21 RV64G instructions togetheraccount for 1.1 trillion fetches, compared to a single x86-64“repeat move” instruction that is executed 450 billion times.Remaining benchmarks
Consistent themes of the remaining benchmarks are asfollows:
• RISC-V’s fused compare-and-branch instruction allowsit to execute typical loops using one less instructioncompared to the ARM and x86 ISAs, both of whichseparate out the comparison and the jump-on-conditioninto two distinct instructions.
• Indexed loads are an incredibly common idiom. Al-though x86-64 and ARM implement indexed loads (reg-ister+register addressing mode) as a single instruction,RISC-V requires up to three instructions to emulate thesame behavior.
In summary, when RISC-V is using fewer instructionsrelative to other ISAs, the code likely contains a significantnumber of branches. When RISC-V is using more instructions,it is often due to a significant number of indexed memory
4
UC Berkeley Micro:ops%and%Macro:op%Fusion
17
instructions (ISA)
micro-ops (µarch)
rep movs
ld ... st ... add...
Micro:ops%generaIon
cmp
bne
jne
Macro:op%Fusion
Conver,ngInstruc,onstoMicroops
32
Mul(plemicroinstruc(onsfromonemacroinstruc(onOronemicroinstruc(onfrommul(plemacroinstruc(ons
Microopsaremeasureofmicroarchitecturalworkperformed
RISC-VMacro-OpFusionExamples
§ “Loadeffec(veaddressLEA”&(array[offset])slli rd, rs1, {1,2,3}add rd, rd, rs2
§ “indexedload”M[rs1+rs2]add rd, rs1, rs2ld rd, 0(rd)
§ “clearupperword”//rd=rs1&0xffff_ffffslli rd, rs1, 32srli rd, rd, 32
§ Canallbefusedsimplyindecodestage- Manyareexpressiblewith2-bytecompressedinstruc(ons,soeffec(velyjustaddsnew4-byteinstruc(ons
§ RISC-Vapproach:prefermacroopfusiontolargerISA33
RISC-VCompe,,veµarchEffortaBerFusion
34
TABLE V: ARMv8 memory instruction counts. Data is shown for normal loads (ld), loads with increment addressing (ldia),load-pairs (ldp), and load-pairs with increment addressing (ldpia). Data is also shown for the corresponding stores. Many ofthese instructions are likely candidates to be broken up into micro-op sequences when executed on a processor pipeline. Forexample, ldia and ldp require two write ports and the ldpia instruction requires three register write ports.
benchmark % of total ARMv8 instruction countld ldia ldp ldpia st stia stp stpia
400.perlbench 18.18 0.06 3.87 1.02 6.14 1.02 3.81 1.02401.bzip2 22.85 1.71 0.53 0.02 8.28 0.02 0.24 0.02403.gcc 16.80 0.11 2.89 1.04 3.32 1.04 3.03 1.04429.mcf 26.61 0.01 3.21 0.07 3.76 0.07 3.22 0.07
445.gobmk 15.77 1.01 2.04 0.77 6.14 0.74 2.19 0.74456.hmmer 24.20 0.09 0.06 0.02 13.75 0.02 0.01 0.02458.sjeng 17.37 0.00 1.30 0.26 4.38 0.26 1.46 0.26
462.libquantum 14.00 0.00 0.15 0.06 1.85 0.06 0.31 0.06464.h264ref 28.36 0.01 6.61 1.85 3.18 1.82 5.91 1.82471.omnetpp 19.16 0.45 2.56 1.55 8.43 1.54 3.11 1.54
473.astar 24.08 0.01 0.84 0.15 3.73 0.15 0.83 0.15483.xalancbmk 20.94 4.84 1.82 0.68 1.74 0.67 1.51 0.67arithmetic mean 20.69 0.69 2.16 0.62 5.39 0.62 2.14 0.62
Fig. 4: The geometric mean of the instruction counts of thetwelve SPECInt benchmarks is shown for each of the ISAs,normalized to x86-64. The x86-64 micro-op count is reportedfrom the micro-architectural counters on an Intel Xeon proces-sor. The RV64GC macro-op count was collected as describedin Section VI-B. The ARMv8 micro-op count was synthet-ically created by breaking up load-increment-address, load-pair, and load-pair-increment-address into multiple micro-ops.
micro-ops. In particular, ARMv8 supports memory opera-tions with increment addressing modes and load-pair/store-pair instructions. Two write-ports are required for the load-pair instruction (ldp) and for loads with increment addressing(ldia), while three write-ports are required for load-pair withincrement addressing (ldpia). We then modified the QEMUARMv8 ISA simulator to count these instructions that arelikely candidates for generating multiple micro-ops.
Although we show the breakdown of all load and storeinstructions in Table V, we assume for Figure 1 that only ldia,ldp, and ldpia increase the micro-op count for our hypotheticalARMv8 processor. Cracking these instructions into multiple
micro-ops leads to an average increase of 4.09% in theoperation count for ARMv8. As a comparison, the Cortex-A72out-of-order processor is reported to emit “less than 1.1 micro-ops” per instruction and breaks down “move-and-branch” and“load/store-multiple” into multiple micro-ops. [6]
We note that it is possible to “brute-force” these ARMv8instructions and handle them as a single operation within theprocessor backend. Many ARMv8 integer instructions requirethree read ports, so it is likely that most (if not all) ARMv8cores will pay the area overhead of a third read port for thecomplex store instructions. Likewise, they can pay the cost toadd a second (or even third) write port to natively support theload-pair and increment addressing modes. Of course, thereis nothing that prevents a RISC-V core from taking on thiscomplexity, adding the additional register ports, and usingmacro-op fusion to emulate the same complex idioms thatARMv8 has chosen to declare at the ISA level.
VIII. RECOMMENDATIONS
A number of lessons can be learned from analyzing RISC-V’s performance on SPECInt.
A. Programmers
Although it is not legal to modify SPEC for benchmarking,an analysis of its hot loops highlight a few coding idioms thatcan hurt performance on RISC-V (and often other) platforms.
• Avoid unsigned 32-bit integers for array indices. Thesize_t type should be used for array indexing and loopcounting.
• Avoid multi-dimensional arrays if the sizes are knownand fixed. Each additional dimension in the array is anextra level of indirection in C, which is another load frommemory.
• C standard aliasing rules can prevent the compiler frommaking optimizations that are otherwise “obvious” to theprogrammer. For example, you may need to manually‘lift’ code out of a loop that returns the same value everyiteration.
9
[DetailsinUCB2016TRand4thRISC-VworkshoptalkbyChrisCelio]
UCBerkeleyRISC-VCoreGenerators
§ Rocket:FamilyofIn-orderCores- Supports32-bitand64-bitsingle-issueonly- Dual-issuesoon- SimilarinspirittoARMCortexM-seriesandA5/A7/A53
§ BOOM:FamilyofOut-of-OrderCores- Supports64-bitsingle-,dual-,quad-issue- SimilarinspirittoARMCortexA9/A15/A57
35
RISC-VRocketIn-OrderCore
- 64-bit5-stagesingle-issuein-orderpipeline- Designminimizesimpactoflongclock-to-outputdelaysofcompiler-generatedRAMs
- 64-entryBTB,256-entryBHT,2-entryRAS- MMUsupportspage-basedvirtualmemory- IEEE754-2008-compliantFPU- SupportsSP,DPfusedmul(ply-addswithhardwaresupportforsubnormals
- Currentlyworkingondual-issuein-orderRocket
PCGen. I$
Access
ITLBInt.EX D$
Access
DTLBCommit
FP.RF FP.EX1 FP.EX2 FP.EX3
Inst. Decode
Int.RF
VI$Access
VITLBVInst.
DecodeSeq-
uencer ExpandPCGen.
Bank1R/W
Bank8R/W
...
Rocket Pipeline
to Hwacha
from Rocket Hwacha Pipeline
bypass paths omittedfor simplicity
PC IF ID EX MEM WBTo RoCCCoprocessor
36
ARMCortex-A5vs.RISC-VRocket
Category ARMCortex-A5 RISC-VRocket
ISA 32-bitARMv7 64-bitRISC-Vv2
Architecture Single-IssueIn-Order Single-IssueIn-Order5-stage
Performance 1.57DMIPS/MHz 1.72DMIPS/MHz
Process TSMC40GPLUS TSMC40GPLUS
Areaw/oCaches 0.27mm2 0.14mm2
Areawith16KCaches 0.53mm2 0.39mm2
AreaEfficiency 2.96DMIPS/MHz/mm2 4.41DMIPS/MHz/mm2
Frequency >1GHz >1GHz
DynamicPower <0.08mW/MHz 0.034mW/MHz
- PPArepor(ngcondi(ons- 85%u(liza(on,useDhrystoneforbenchmark,frequency/poweratTT0.9V25C,allregularVTtransistors
- 10%higherinDMIPS/MHz,49%morearea-efficient37
RISC-VBOOMOut-of-OrderCore
§ BaselineDesign- Superscalar(parameterizablewidths)- Fullbranchspecula(on(BTB/BHT/RAS)- Load/storequeuewithstoreordering- LoadsexecutefullyOoOwrtstores,otherloads- Store-dataforwardstoloads
§ Es(mated2GHz+in45nm(<30FO4)
38
ARMCortex-A9vs.RISC-VBOOM
Category ARMCortex-A9 RISC-VBOOM-2w
ISA 32-bitARMv7 64-bitRISC-Vv2(RV64G)
Architecture 2wide,3+1issueOut-of-Order8-stage
2wide,3issueOut-of-Order6-stage
Performance 3.59CoreMarks/MHz 3.91CoreMarks/MHz
Process TSMC40GPLUS TSMC40GPLUS
Areawith32Kcaches 2.5mm2 1.00mm2
Areaefficiency 1.4CoreMarks/MHz/mm2 3.9CoreMarks/MHz/mm2
Frequency 1.4GHz 1.5GHz
Caveats:A9includesNEONBOOMis64-bit,hasIEEE-2008fusedmul-add 39
CoreMarkScores
0
1.00
2.00
3.00
4.00
5.00
6.00CoreMark/MHz
CoreM
ark/MHz
UC Berkeley Industry*Comparisons
2
BOOM
$4w
BOOM
$2w
Rocket
in$orderprocessors
out$of$orderprocessors
Ivy9Brid
ge
Cortex
$A15
Cortex
$A9
Cortex
$A8
Cortex
$A5MIPS74
k
SeeChrisCelio’s“BOOM:BerkeleyOut-of-OrderMachine”talkfrom2ndRISC-VWorkshopformoredetails 40
RocketChipGenerator
41
RocketTile
Rocket L1I$
L1D$
RoCCAccel.
CSRFile
L1 Network
L2$ Bank
RocketTile
Rocket L1I$
L1D$
RoCCAccel.CSR
File
JTAGDebug
L2 Network
Cache-CoherentDevice
L2$ BankL2$ Bank
L2$ BankCache-CoherentDevice
TileLink/AXI4Bridge
TileLink/AXI4Bridge
AXI4 Crossbar
DRAMController
DRAMController
High-Speed
IO Device
High-Speed IO
Device
AXI4/AHBBridge
AHB-Lite Bus
Z-scale
Low-Speed
IO DeviceScratch
PadSCRFile
AHB/APBBridge
APB Bus
Peripheral Peripheral Peripheral
1. Change Parameters 2. Develop New Accelerators 3. Develop Own RISC-V Core 4. Develop Own Device
42
Raven-1 Raven-2
Raven-3
Raven-4
EOS14
EOS16
EOS18
EOS20
2011 2012 2013 2014 2015
May Apr Aug Feb Jul Sep Mar Nov Mar
EOS: IBM 45nm SOI Raven: ST 28nm FDSOI Hurricane: ST 28nm FDSOI SWERVE: TSMC 28nm
SWERVE
EOS22 EOS24
UCBerkeleyRISC-VCores:Six28nm&Six45nmRISC-VChipsTapeoutsSoFar
AllbasedonRocketin-ordercore
1GHz
1.65GHz
Hurricane-1
2016
Hurricane-2
43
• Open-Source RTL • Arduino-Compatible • Freedom E SDK
• Arduino IDE Environment
• Available for sale now! • $59
https://www.crowdsupply.com/sifive/hifive1
44
RISC-V is GREAT at Perf and Power Microcontroller CPUCore CPUISA CPU
SpeedDMIPs/MHz Total
DhrystonesDMIPs/mW
IntelCurieModule
IntelQuarkSE
x86 32MHz 1.3 41.6 0.35
ATmega328P AVR AVR(8-bit) 16MHz 0.30 5 0.10
ATSAMD21G18 ARMCortexM0+
ARMv6-M
48MHz 0.93 44.64
NordicNRF51 ARMCortexM0+
ARMv6-M 16MHz 0.93 14.88 1.88
FreedomE310 SiFiveE31 RISC-VRV32IMAC
200MHz320MHz(max)
1.61 320.39 3.16
©2016SiFive.AllRightsReserved.
• 10x Faster Clock than Intel’s Arduino 101 uController • 11x More Dhrystones than ARM’s Arduino Zero (ATSAMD21G18) • 9x More Power Efficient than Intel Quark • 2x More Power Efficient than ARM Cortex M0+
RISC-VOutsideBerkeley
§ Adoptedas“standardISA”forIndia- IIT-Madras$90Mfundingtobuild6differentopen-sourceRISC-Vcores,frommicrocontrollerstoservers
- C-DAC$45Mfundingtobuild2GHzquad-core§ NVIDIAselectedRISC-Vforon-chipmicrocontrollers§ LowRISCprojectbasedinCambridge,UKproducingopen-sourceRISC-VRocket-basedSoCs- LedbyRaspberryPico-founder,privatelyfunded
§ ManycompaniesdevelopingRISC-Vcoresforusein(ghtlyintegratedhardware/sogwareIPblocks
§ FirstcommercialRISC-Vcoreshavealreadyshipped!- RumbleDevelopmentCorp,fordentalcamera/imaging
§ Mul(plecommercialsiliconimplementa(onsshouldbeforsalelaterthisyear 45
RISC-VFounda,on
§ Missionstatement- “tostandardize,protect,andpromotethefreeandopenRISC-Vinstruc(onsetarchitectureanditshardwareandsogwareecosystemforuseinallcompu(ngdevices.”
§ Establishedasa501(c)(6)non-profitcorpora(ononAugust3,2015
§ RickO’ConnorrecruitedasExecu(veDirector§ Firstyear,41+“founding”members.Addi(onalmemberswelcome
46
WhyuseafreeandopenISAlikeRISC-V?
ModestRISC-VProjectGoalBecometheindustry-standardISAforallcompuKng
devices
ModestRISC-VProjectGoalBecometheindustry-standardISAforallcompuKng
devices
48
Would you like to: Or are you happy with: Pick ISA then pick vendor Pick vendor, use their ISA Get competitive bid for 2nd gen. core Vendor lock in Enable open software stack Binary blobs / software NDAs Build your own core configuration Buying a vendor configuration Sharing your core designs No community for core designs Share spec and verification Trusting vendors’ verification Teach class with real core design Using crippled cores in teaching Resell IP with controller core inside Ask customers to license controller Be assured support for 50+ years Trusting vendor business decisions
RISC-VResearchProjectSponsors
▪ DoEIsisProject▪ DARPAPERFECTprogram▪ DARPAPOEMprogram(Siphotonics)▪ STARnetCenterforFutureArchitectures(C-FAR)▪ LawrenceBerkeleyNa(onalLaboratory▪ Industrialsponsors(ParLab+ASPIRE)
- Intel,Google,HPE,Huawei,LG,NEC,Microsog,Nokia,NVIDIA,Oracle,Samsung
Ques,ons?
49